Notebook Setting

Load the required packages.

options(stringsAsFactors = FALSE)
library(tidyverse)
library(lubridate)
library(ggplot2)
library(ggthemes)
library(fastDummies)
library(scales)
library(glmnet)
library(zoo)
library(ISLR)
library(leaps)
library(rpart)
library(rpart.plot)
library(tree)
library(randomForest)
library(gbm)
# Set the ggtheme beforehead for all plots
theme_set(theme_clean())

Introduction

This notebook is created by Cohort A Team 7 for the BA 810 team project. The content includes our business setting to a existing dataset and our attempts in machine learning model training.

Business Problem

Our team is planning to open a store called “Everything Avocado”, basically selling avocado food products including avocado toasts, tortilla chips with guacamole, and avocado smoothies. During opening preparation, we want to set a budget in buying raw avocados as inventories, so we would like to learn the future price for avocados.

Fortunately, all of the team members are taking a machine learning course together, then we decided to develope a model to predict avocado prices using existing data source.

Audience of our model

We expect that our optimal model could be applied to other raw product retails, especially in fresh products (fruits and vegetables). Ideally, the price any products of this type could be predicted, if the prices are influenced by similar predictors. In a business setting, whoever wants to get a sense of future product price (retailers, customers, market analysts, etc.) will hold a huge interest in our model.


Dataset Debrief

The dataset we keep using is the Avocado Price dataset from Kaggle. It is the record of historical data on avocado prices and sales volume in multiple US markets.

ds <- read_csv("avocado.csv")  
Missing column names filled in: 'X1' [1]Parsed with column specification:
cols(
  X1 = col_double(),
  Date = col_date(format = ""),
  AveragePrice = col_double(),
  `Total Volume` = col_double(),
  `4046` = col_double(),
  `4225` = col_double(),
  `4770` = col_double(),
  `Total Bags` = col_double(),
  `Small Bags` = col_double(),
  `Large Bags` = col_double(),
  `XLarge Bags` = col_double(),
  type = col_character(),
  year = col_double(),
  region = col_character()
)
glimpse(ds)
Observations: 18,249
Variables: 14
$ X1             <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,…
$ Date           <date> 2015-12-27, 2015-12-20, 2015-12-13, 2015…
$ AveragePrice   <dbl> 1.33, 1.35, 0.93, 1.08, 1.28, 1.26, 0.99,…
$ `Total Volume` <dbl> 64236.62, 54876.98, 118220.22, 78992.15, …
$ `4046`         <dbl> 1036.74, 674.28, 794.70, 1132.00, 941.48,…
$ `4225`         <dbl> 54454.85, 44638.81, 109149.67, 71976.41, …
$ `4770`         <dbl> 48.16, 58.33, 130.50, 72.58, 75.78, 43.61…
$ `Total Bags`   <dbl> 8696.87, 9505.56, 8145.35, 5811.16, 6183.…
$ `Small Bags`   <dbl> 8603.62, 9408.07, 8042.21, 5677.40, 5986.…
$ `Large Bags`   <dbl> 93.25, 97.49, 103.14, 133.76, 197.69, 127…
$ `XLarge Bags`  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ type           <chr> "conventional", "conventional", "conventi…
$ year           <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015,…
$ region         <chr> "Albany", "Albany", "Albany", "Albany", "…

Here is a quick view of columns and data types. There are 14 columns in this dataset, and we will use AveragePrice as our prediction. X1 gives a id for each observation, and we want to transfer that into row numbers for each row by adding 1. Most of the data types are <dbl>, and there are only 2 columns with character vectors, type and region. We will transfer them to dummy variables to train our models. The time span is from Janurary 4, 2015 to March 25, 2018.

Before going into the modeling, we made a plot for the average price of avocados grouped by date, and we observed that average price usually increases during the second half of the year.

ds %>% 
group_by(Date) %>%
  summarise(meanpriced = mean(AveragePrice)) %>%
  ggplot(aes(x = Date, y = meanpriced)) +
  geom_point(col = "#4a7337") +
  geom_smooth(col = "#d9cd65", se = FALSE) + 
  labs(y = "Average price",
       title = "Average price by date") + 
  theme_bw()

We also map the average quantity (Volume) by dates. We noticed that in the beginning of each year, there is one day when the volume sold is extremely high. We then checked the calendar and happened to find that these days were all Super Bowl game days!

ds %>%
  group_by(Date) %>%
  summarise(totalvold = sum(`Total Volume`)) %>%
  ggplot(aes(x = Date, y = totalvold)) +
  geom_point(col = "#4a7337") +
  geom_smooth(col = "#d9cd65", se = FALSE) + 
  labs(y = "Total volumn sold",
       title = "Total volumn sold by date") + 
  theme_bw() 


Data Preparation

In the following code chunk, we took multiple steps to arrange the columns.

# Switch the order of rows
ds <- ds %>% 
  group_by(type, region) %>% 
  select(X1, year, Date, type, region, everything()) %>% 
  arrange(Date)
# Change `X1` to `ID`
colName <- names(ds)
colName[1] <- "ID"
names(ds) <- colName
# Assign distinct number id to each observation in `ID` column
ds$ID <- seq(nrow(ds))
# Extract month from `Date` column
ds$month <- month(ds$Date)
ds <- ds %>% 
  select(ID, year, month, everything())

There are only 2 types of avocados recorded, conventional and organic. We decided to separate type column into type_conventional and type_organic.

dsNew <- dummy_cols(ds, select_columns = "type") %>% 
  select(ID, year, month, region, type_conventional, type_organic, 
         everything(), -type)

The Product Look Up (PLU) codes are used by grocery retailers to make differnet product inventories. This dataset includes 3 PLU avocados, and we added a new column other_PLU to see the volume sold in products without PLU. The number is calculated by substracting PLU volumes from total volumes.

dsNew$other_PLU <- dsNew$`Total Volume` - dsNew$`4046` - dsNew$`4225` - dsNew$`4770`
# Reorder the columes
dsNew <- dsNew %>% 
  select(1:3, Date, everything())

We initially included other_PLU as a predictor to our models, but then we found that other_PLU is eauql to Total Bags. This equivlance resolved our confusion on bag variables: ?????

Categorize region into US Areas

uniqueRegion <- unique(dsNew$region)
uniqueRegion <- as.data.frame(uniqueRegion)
uniqueRegion$Area <- NA
uniqueRegion$Area[1] <- "NewEngland"
uniqueRegion$Area[2] <- "Southeast"
uniqueRegion$Area[3] <- "Mideast"
uniqueRegion$Area[4] <- "RockyMountain"
uniqueRegion$Area[5] <- "NewEngland"
uniqueRegion$Area[6] <- "Mideast"
uniqueRegion$Area[7] <- "FarWest"
uniqueRegion$Area[8] <- "Southeast"
uniqueRegion$Area[9] <- "GreatLakes"
uniqueRegion$Area[10] <- "GreatLakes"
uniqueRegion$Area[11] <- "GreatLakes"
uniqueRegion$Area[12] <- "Southwest"
uniqueRegion$Area[13] <- "RockyMountain"
uniqueRegion$Area[14] <- "GreatLakes"
uniqueRegion$Area[15] <- "GreatLakes"
uniqueRegion$Area[16] <- "GreatLakes"
uniqueRegion$Area[17] <- "Mideast"
uniqueRegion$Area[18] <- "NewEngland"
uniqueRegion$Area[19] <- "Southeast"
uniqueRegion$Area[20] <- "GreatLakes"
uniqueRegion$Area[21] <- "Southeast"
uniqueRegion$Area[22] <- "FarWest"
uniqueRegion$Area[23] <- "FarWest"
uniqueRegion$Area[24] <- "Southeast"
uniqueRegion$Area[25] <- "Southeast"
uniqueRegion$Area[26] <- "Southeast"
uniqueRegion$Area[27] <- "Southeast"
uniqueRegion$Area[28] <- "Southeast"
uniqueRegion$Area[29] <- "Mideast"
uniqueRegion$Area[30] <- "NewEngland"
uniqueRegion$Area[31] <- "NewEngland"
uniqueRegion$Area[32] <- "Southeast"
uniqueRegion$Area[33] <- "Mideast"
uniqueRegion$Area[34] <- "Southwest"
uniqueRegion$Area[35] <- "Mideast"
uniqueRegion$Area[36] <- "Plains"
uniqueRegion$Area[37] <- "FarWest"
uniqueRegion$Area[38] <- "Southeast"
uniqueRegion$Area[39] <- "Southeast"
uniqueRegion$Area[40] <- "Southeast"
uniqueRegion$Area[41] <- "FarWest"
uniqueRegion$Area[42] <- "FarWest"
uniqueRegion$Area[43] <- "FarWest"
uniqueRegion$Area[44] <- "FarWest"
uniqueRegion$Area[45] <- "Southeast"
uniqueRegion$Area[46] <- "Southeast"
uniqueRegion$Area[47] <- "Southeast"
uniqueRegion$Area[48] <- "FarWest"
uniqueRegion$Area[49] <- "Plains"
uniqueRegion$Area[50] <- "Mideast"
uniqueRegion$Area[51] <- "Southeast"
uniqueRegion$Area[52] <- "TotalUS"
uniqueRegion$Area[53] <- "FarWest"
uniqueRegion$Area[54] <- "Southwest"
names(uniqueRegion)[1] <- "region"
avo <- dsNew %>% 
  left_join(uniqueRegion, by = "region") %>% 
  select(1:5, Area, everything())
avo <- dummy_cols(avo, select_columns = "Area")

Rename Column Names

colnames(avo)
 [1] "ID"                 "year"               "month"             
 [4] "Date"               "region"             "Area"              
 [7] "type_conventional"  "type_organic"       "AveragePrice"      
[10] "Total Volume"       "4046"               "4225"              
[13] "4770"               "Total Bags"         "Small Bags"        
[16] "Large Bags"         "XLarge Bags"        "other_PLU"         
[19] "Area_NewEngland"    "Area_Southeast"     "Area_Mideast"      
[22] "Area_RockyMountain" "Area_FarWest"       "Area_GreatLakes"   
[25] "Area_Southwest"     "Area_Plains"        "Area_TotalUS"      
names(avo)[10] <- "TotalVolume"
names(avo)[14] <- "TotalBags"
names(avo)[15] <- "SmallBags"
names(avo)[16] <- "LargeBags"
names(avo)[17] <- "XLargeBags"
names(avo)[11] <- "PLU4046"
names(avo)[12] <- "PLU4225"
names(avo)[13] <- "PLU4770"

Create Train and Test Datasets

set.seed(1234)
avo_train <- avo %>% filter(as.Date(Date) < "2017-03-01")
avo_train %>%
  filter(year == 2017, month == 2)
avo_test <- avo %>% filter(as.Date(Date) >= "2017-03-01")
avo_test %>%
  filter(year == 2018, month == 3)

Base model

f1 <- as.formula(AveragePrice ~ month+type_conventional + type_organic + TotalVolume + 
                   PLU4046 + PLU4770 + PLU4225 + SmallBags + LargeBags + XLargeBags + 
                   + Area_NewEngland + Area_Southeast
                 + Area_Mideast + Area_RockyMountain
                 + Area_FarWest + Area_GreatLakes
                 + Area_Southwest + Area_Plains + Area_TotalUS)
x1_train <- model.matrix(f1,avo_train)[,-1]
y1_train <- avo_train$AveragePrice
x1_test <- model.matrix(f1,avo_test)[,-1]
y1_test <- avo_test$AveragePrice
x1_avo <- model.matrix(f1, avo)[,-1]

Modeling

fit.lm <- lm(f1, avo_train)
fit.lm

Call:
lm(formula = f1, data = avo_train)

Coefficients:
       (Intercept)               month   type_conventional  
          1.525491            0.015248           -0.517023  
      type_organic         TotalVolume             PLU4046  
                NA           -0.229128            0.229128  
           PLU4770             PLU4225           SmallBags  
          0.229128            0.229128            0.229128  
         LargeBags          XLargeBags     Area_NewEngland  
          0.229128            0.229129            0.194460  
    Area_Southeast        Area_Mideast  Area_RockyMountain  
         -0.062017            0.167542           -0.166883  
      Area_FarWest     Area_GreatLakes      Area_Southwest  
         -0.007661           -0.065708           -0.198306  
       Area_Plains        Area_TotalUS  
          0.009257                  NA  
y.train <- avo_train$AveragePrice
y.test <- avo_test$AveragePrice
yhat.train.lm <- predict(fit.lm)
mse.train.lm <- mean((y.train - yhat.train.lm)^2)
yhat.test.lm <- predict(fit.lm, avo_test)
prediction from a rank-deficient fit may be misleading
mse.test.lm <- mean((y.test - yhat.test.lm)^2)
mse.train.lm
[1] 0.06454066
mse.test.lm
[1] 387.6257
coef(fit.lm)
       (Intercept)              month  type_conventional 
       1.525490825        0.015248443       -0.517023414 
      type_organic        TotalVolume            PLU4046 
                NA       -0.229128378        0.229128340 
           PLU4770            PLU4225          SmallBags 
       0.229128450        0.229128396        0.229128404 
         LargeBags         XLargeBags    Area_NewEngland 
       0.229128254        0.229128567        0.194459980 
    Area_Southeast       Area_Mideast Area_RockyMountain 
      -0.062016571        0.167542328       -0.166883258 
      Area_FarWest    Area_GreatLakes     Area_Southwest 
      -0.007660860       -0.065707641       -0.198305711 
       Area_Plains       Area_TotalUS 
       0.009257048                 NA 
plot(fit.lm)

avo1 <- avo
avo1$ID <- NULL
avo1$year<-NULL
avo1$Date<-NULL
avo1$region<-NULL
avo1$Area<-NULL
avo1$AveragePrice<-NULL
xnames <- colnames(avo1)
xnames <- xnames[!xnames %in% c("type_conventional","type_organic", "TotalVolume", "PLU4046", "PLU4770", "PLU4225","SmallBags", "LargeBags","XLargeBags","Area_NewEngland","Area_Southeast","Area_Mideast","Area_RockyMountain","Area_FarWest","Area_GreatLakes","Area_Southwest","Area_Plains + Area_TotalUS")]
fit_fw <- lm(AveragePrice ~ 1, data = avo_train)
yhat_train <- predict(fit_fw, avo_train)
mse_train <- mean((avo_train$AveragePrice - yhat_train) ^ 2)
yhat_test <- predict(fit_fw, avo_test)
mse_test <- mean((avo_test$AveragePrice - yhat_test) ^ 2)
log_fw <-
tibble(
xname = "intercept",
model = deparse(fit_fw$call),
mse_train = mse_train,
mse_test = mse_test
)
best_mse_train <- NA
best_mse_test <- NA
best_fit_fw <- NA
best_xname <- NA
for (xname in xnames) {
# take a moment to examine and understand the following line
fit_fw_tmp <- update(fit_fw, as.formula(paste0(" ~ ", xname)))
# compute MSE train
yhat_train_tmp <- predict(fit_fw_tmp, avo_train)
mse_train_tmp <- mean((avo_train$AveragePrice - yhat_train_tmp) ^ 2)
# compute MSE test
yhat_test_tmp <- predict(fit_fw_tmp, avo_test)
mse_test_tmp <- mean((avo_test$AveragePrice - yhat_test_tmp) ^ 2)
# if this is the first predictor to be examined,
# or if this predictors yields a lower MSE that the current
# best, then store this predictor as the current best predictor
if (is.na(best_mse_test) | mse_test_tmp < best_mse_test) {
best_xname <- xname
best_fit_fw <- fit_fw_tmp
best_mse_train <- mse_train_tmp
best_mse_test <- mse_test_tmp
}
}
best_mse_train
[1] 0.147114
best_mse_test
[1] 0.1893665
log_fw <-
log_fw %>% add_row(
xname = best_xname,
model = paste0(deparse(best_fit_fw$call), collapse = ""),
mse_train = best_mse_train,
mse_test = best_mse_test
)
log_fw
# here is a complete solution
xnames<- colnames(avo1)
xnames <- xnames[!xnames %in% c("type_conventional","type_organic", "TotalVolume", "PLU4046", "PLU4770", "PLU4225","SmallBags", "LargeBags","XLargeBags","Area_NewEngland","Area_Southeast","Area_Mideast","Area_RockyMountain","Area_FarWest","Area_GreatLakes","Area_Southwest","Area_Plains + Area_TotalUS")]
fit_fw <- lm(AveragePrice ~ 1, data = avo_train)
yhat_train <- predict(fit_fw, avo_train)
yhat_test <- predict(fit_fw, avo_test)
mse_train <- mean((avo_train$AveragePrice - yhat_train)^2)
mse_test <- mean((avo_test$AveragePrice - yhat_test)^2)
xname <- "intercept"
log_fw <-
tibble(
xname = xname,
model = paste0(deparse(fit_fw$call), collapse = ""),
mse_train = mse_train,
mse_test = mse_test
)
while (length(xnames) > 0) {
best_mse_train <- NA
best_mse_test <- NA
best_fit_fw <- NA
best_xname <- NA
# select the next best predictor
for (xname in xnames) {
# take a moment to examine and understand the following line
fit_fw_tmp <- update(fit_fw, as.formula(paste0(". ~ . + ", xname)))
# compute MSE train
yhat_train_tmp <- predict(fit_fw_tmp, avo_train)
mse_train_tmp <- mean((avo_train$AveragePrice - yhat_train_tmp) ^ 2)
# compute MSE test
yhat_test_tmp <- predict(fit_fw_tmp, avo_test)
mse_test_tmp <- mean((avo_test$AveragePrice - yhat_test_tmp) ^ 2)
# if this is the first predictor to be examined,
# or if this predictors yields a lower MSE that the current
# best, then store this predictor as the current best predictor
if (is.na(best_mse_test) | mse_test_tmp < best_mse_test) {
best_xname <- xname
best_fit_fw <- fit_fw_tmp
best_mse_train <- mse_train_tmp
best_mse_test <- mse_test_tmp
}
}
log_fw <-
log_fw %>% add_row(
xname = best_xname,
model = paste0(deparse(best_fit_fw$call), collapse = ""),
mse_train = best_mse_train,
mse_test = best_mse_test
)
# adopt the best model for the next iteration
fit_fw <- best_fit_fw
# remove the current best predictor from the list of predictors
xnames <- xnames[xnames!=best_xname]
}
prediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleading
ggplot(log_fw, aes(seq_along(xname), mse_test)) +
geom_point() +
geom_line() +
geom_point(aes(y=mse_train), color="#AA471F") +
geom_line(aes(y=mse_train), color="#AA471F") +
scale_x_continuous("Variables", labels = log_fw$xname, breaks = seq_along(log_fw$xname)) +
scale_y_continuous("MSE test") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

fw<- as.formula(AveragePrice ~ month + TotalBags + Area_TotalUS + Area_Plains + other_PLU)
fit.fw<-lm(fw, data = avo_train)
## calculate the yhat price for the avo dataset
yhat_avo_fw <- predict(fit.fw, avo)
prediction from a rank-deficient fit may be misleading
df1_fw<-avo %>% 
  select(Date, AveragePrice)
df2_fw<-cbind(df1_fw, yhat_avo_fw)
colnames(df2_fw)
[1] "Date"         "AveragePrice" "yhat_avo_fw" 
names(df2_fw)[3]<-"AveragePrice_hat"
## Plot the actual average price and the predictive average price
plot1_fw <- df2_fw %>% 
  group_by(Date) %>% 
  summarize(
    MeanAvg=mean(AveragePrice),
    MeanAvg_hat=mean(AveragePrice_hat))%>% 
    ggplot()+
    geom_line(aes(Date, MeanAvg),color = "#356211")+
    geom_line(aes(Date, MeanAvg_hat), color = "#cda989")+
    theme_classic()
plot1_fw + geom_vline(aes(xintercept = as.numeric(Date[113])), 
                   linetype = "dashed", size = 1,
                   color = "#AA471F")

regfit.bwd = regsubsets(f1, data = avo_train, nvmax = 19, method = "backward")
2  linear dependencies found
Reordering variables and trying again:
number of items to replace is not a multiple of replacement length
summary(regfit.bwd)
Subset selection object
Call: regsubsets.formula(f1, data = avo_train, nvmax = 19, method = "backward")
19 Variables  (and intercept)
                   Forced in Forced out
month                  FALSE      FALSE
type_conventional      FALSE      FALSE
TotalVolume            FALSE      FALSE
PLU4046                FALSE      FALSE
PLU4770                FALSE      FALSE
PLU4225                FALSE      FALSE
SmallBags              FALSE      FALSE
LargeBags              FALSE      FALSE
XLargeBags             FALSE      FALSE
Area_NewEngland        FALSE      FALSE
Area_Southeast         FALSE      FALSE
Area_Mideast           FALSE      FALSE
Area_RockyMountain     FALSE      FALSE
Area_FarWest           FALSE      FALSE
Area_GreatLakes        FALSE      FALSE
Area_Southwest         FALSE      FALSE
Area_Plains            FALSE      FALSE
type_organic           FALSE      FALSE
Area_TotalUS           FALSE      FALSE
1 subsets of each size up to 17
Selection Algorithm: backward
          month type_conventional type_organic TotalVolume PLU4046
1  ( 1 )  " "   "*"               " "          " "         " "    
2  ( 1 )  " "   "*"               " "          " "         " "    
3  ( 1 )  " "   "*"               " "          " "         " "    
4  ( 1 )  "*"   "*"               " "          " "         " "    
5  ( 1 )  "*"   "*"               " "          " "         " "    
6  ( 1 )  "*"   "*"               " "          " "         " "    
7  ( 1 )  "*"   "*"               " "          " "         " "    
8  ( 1 )  "*"   "*"               " "          "*"         " "    
9  ( 1 )  "*"   "*"               " "          "*"         " "    
10  ( 1 ) "*"   "*"               " "          "*"         " "    
11  ( 1 ) "*"   "*"               " "          "*"         " "    
12  ( 1 ) "*"   "*"               " "          "*"         " "    
13  ( 1 ) "*"   "*"               " "          "*"         "*"    
14  ( 1 ) "*"   "*"               " "          "*"         "*"    
15  ( 1 ) "*"   "*"               " "          "*"         "*"    
16  ( 1 ) "*"   "*"               " "          "*"         "*"    
17  ( 1 ) "*"   "*"               " "          "*"         "*"    
          PLU4770 PLU4225 SmallBags LargeBags XLargeBags
1  ( 1 )  " "     " "     " "       " "       " "       
2  ( 1 )  " "     " "     " "       " "       " "       
3  ( 1 )  " "     " "     " "       " "       " "       
4  ( 1 )  " "     " "     " "       " "       " "       
5  ( 1 )  " "     " "     " "       " "       " "       
6  ( 1 )  " "     " "     " "       " "       " "       
7  ( 1 )  " "     " "     " "       " "       " "       
8  ( 1 )  " "     " "     " "       " "       " "       
9  ( 1 )  " "     " "     " "       " "       " "       
10  ( 1 ) " "     "*"     " "       " "       " "       
11  ( 1 ) " "     "*"     "*"       " "       " "       
12  ( 1 ) "*"     "*"     "*"       " "       " "       
13  ( 1 ) "*"     "*"     "*"       " "       " "       
14  ( 1 ) "*"     "*"     "*"       " "       "*"       
15  ( 1 ) "*"     "*"     "*"       "*"       "*"       
16  ( 1 ) "*"     "*"     "*"       "*"       "*"       
17  ( 1 ) "*"     "*"     "*"       "*"       "*"       
          Area_NewEngland Area_Southeast Area_Mideast
1  ( 1 )  " "             " "            " "         
2  ( 1 )  " "             " "            "*"         
3  ( 1 )  "*"             " "            "*"         
4  ( 1 )  "*"             " "            "*"         
5  ( 1 )  "*"             " "            "*"         
6  ( 1 )  "*"             " "            "*"         
7  ( 1 )  "*"             "*"            "*"         
8  ( 1 )  "*"             "*"            "*"         
9  ( 1 )  "*"             "*"            "*"         
10  ( 1 ) "*"             "*"            "*"         
11  ( 1 ) "*"             "*"            "*"         
12  ( 1 ) "*"             "*"            "*"         
13  ( 1 ) "*"             "*"            "*"         
14  ( 1 ) "*"             "*"            "*"         
15  ( 1 ) "*"             "*"            "*"         
16  ( 1 ) "*"             "*"            "*"         
17  ( 1 ) "*"             "*"            "*"         
          Area_RockyMountain Area_FarWest Area_GreatLakes
1  ( 1 )  " "                " "          " "            
2  ( 1 )  " "                " "          " "            
3  ( 1 )  " "                " "          " "            
4  ( 1 )  " "                " "          " "            
5  ( 1 )  " "                " "          " "            
6  ( 1 )  "*"                " "          " "            
7  ( 1 )  "*"                " "          " "            
8  ( 1 )  "*"                " "          " "            
9  ( 1 )  "*"                " "          "*"            
10  ( 1 ) "*"                " "          "*"            
11  ( 1 ) "*"                " "          "*"            
12  ( 1 ) "*"                " "          "*"            
13  ( 1 ) "*"                " "          "*"            
14  ( 1 ) "*"                " "          "*"            
15  ( 1 ) "*"                " "          "*"            
16  ( 1 ) "*"                " "          "*"            
17  ( 1 ) "*"                "*"          "*"            
          Area_Southwest Area_Plains Area_TotalUS
1  ( 1 )  " "            " "         " "         
2  ( 1 )  " "            " "         " "         
3  ( 1 )  " "            " "         " "         
4  ( 1 )  " "            " "         " "         
5  ( 1 )  "*"            " "         " "         
6  ( 1 )  "*"            " "         " "         
7  ( 1 )  "*"            " "         " "         
8  ( 1 )  "*"            " "         " "         
9  ( 1 )  "*"            " "         " "         
10  ( 1 ) "*"            " "         " "         
11  ( 1 ) "*"            " "         " "         
12  ( 1 ) "*"            " "         " "         
13  ( 1 ) "*"            " "         " "         
14  ( 1 ) "*"            " "         " "         
15  ( 1 ) "*"            " "         " "         
16  ( 1 ) "*"            "*"         " "         
17  ( 1 ) "*"            "*"         " "         

Calculate the yhat and MSE for train and test dataset, the test MSE is way to large for the model including all the predictors.

sub_model<-lm(f1, data = avo_train)
yhat_train_stepwise <- predict(sub_model, avo_train)
prediction from a rank-deficient fit may be misleading
MSE_train_stepwise <- mean((avo_train$AveragePrice - yhat_train_stepwise)^2)
MSE_train_stepwise
[1] 0.06454066
yhat_test_stepwise <- predict(sub_model, avo_test)
prediction from a rank-deficient fit may be misleading
MSE_test_stepwise <- mean((avo_test$AveragePrice - yhat_test_stepwise)^2)
MSE_test_stepwise
[1] 387.6257

Based on the result of summary, the best fit model contains everything except “Area_TotalUS”, “type_organic”. So we excluded these two predictors to create a new formula to train the model.

f1_1 <- as.formula(AveragePrice ~ month + type_conventional + TotalVolume + 
                      PLU4046 + PLU4770 + PLU4225 + SmallBags + LargeBags + XLargeBags + 
                      + Area_NewEngland + Area_Southeast
                    + Area_Mideast + Area_RockyMountain
                    + Area_FarWest + Area_GreatLakes
                    + Area_Southwest
                    + Area_Plains)
regfit.bwd1 = regsubsets(f1_1, data = avo_train, nvmax = 19, method = "backward")
summary(regfit.bwd1)
Subset selection object
Call: regsubsets.formula(f1_1, data = avo_train, nvmax = 19, method = "backward")
17 Variables  (and intercept)
                   Forced in Forced out
month                  FALSE      FALSE
type_conventional      FALSE      FALSE
TotalVolume            FALSE      FALSE
PLU4046                FALSE      FALSE
PLU4770                FALSE      FALSE
PLU4225                FALSE      FALSE
SmallBags              FALSE      FALSE
LargeBags              FALSE      FALSE
XLargeBags             FALSE      FALSE
Area_NewEngland        FALSE      FALSE
Area_Southeast         FALSE      FALSE
Area_Mideast           FALSE      FALSE
Area_RockyMountain     FALSE      FALSE
Area_FarWest           FALSE      FALSE
Area_GreatLakes        FALSE      FALSE
Area_Southwest         FALSE      FALSE
Area_Plains            FALSE      FALSE
1 subsets of each size up to 17
Selection Algorithm: backward
          month type_conventional TotalVolume PLU4046 PLU4770
1  ( 1 )  " "   "*"               " "         " "     " "    
2  ( 1 )  " "   "*"               " "         " "     " "    
3  ( 1 )  " "   "*"               " "         " "     " "    
4  ( 1 )  "*"   "*"               " "         " "     " "    
5  ( 1 )  "*"   "*"               " "         " "     " "    
6  ( 1 )  "*"   "*"               " "         " "     " "    
7  ( 1 )  "*"   "*"               " "         " "     " "    
8  ( 1 )  "*"   "*"               "*"         " "     " "    
9  ( 1 )  "*"   "*"               "*"         " "     " "    
10  ( 1 ) "*"   "*"               "*"         " "     " "    
11  ( 1 ) "*"   "*"               "*"         " "     " "    
12  ( 1 ) "*"   "*"               "*"         " "     "*"    
13  ( 1 ) "*"   "*"               "*"         "*"     "*"    
14  ( 1 ) "*"   "*"               "*"         "*"     "*"    
15  ( 1 ) "*"   "*"               "*"         "*"     "*"    
16  ( 1 ) "*"   "*"               "*"         "*"     "*"    
17  ( 1 ) "*"   "*"               "*"         "*"     "*"    
          PLU4225 SmallBags LargeBags XLargeBags Area_NewEngland
1  ( 1 )  " "     " "       " "       " "        " "            
2  ( 1 )  " "     " "       " "       " "        " "            
3  ( 1 )  " "     " "       " "       " "        "*"            
4  ( 1 )  " "     " "       " "       " "        "*"            
5  ( 1 )  " "     " "       " "       " "        "*"            
6  ( 1 )  " "     " "       " "       " "        "*"            
7  ( 1 )  " "     " "       " "       " "        "*"            
8  ( 1 )  " "     " "       " "       " "        "*"            
9  ( 1 )  " "     " "       " "       " "        "*"            
10  ( 1 ) "*"     " "       " "       " "        "*"            
11  ( 1 ) "*"     "*"       " "       " "        "*"            
12  ( 1 ) "*"     "*"       " "       " "        "*"            
13  ( 1 ) "*"     "*"       " "       " "        "*"            
14  ( 1 ) "*"     "*"       " "       "*"        "*"            
15  ( 1 ) "*"     "*"       "*"       "*"        "*"            
16  ( 1 ) "*"     "*"       "*"       "*"        "*"            
17  ( 1 ) "*"     "*"       "*"       "*"        "*"            
          Area_Southeast Area_Mideast Area_RockyMountain
1  ( 1 )  " "            " "          " "               
2  ( 1 )  " "            "*"          " "               
3  ( 1 )  " "            "*"          " "               
4  ( 1 )  " "            "*"          " "               
5  ( 1 )  " "            "*"          " "               
6  ( 1 )  " "            "*"          "*"               
7  ( 1 )  "*"            "*"          "*"               
8  ( 1 )  "*"            "*"          "*"               
9  ( 1 )  "*"            "*"          "*"               
10  ( 1 ) "*"            "*"          "*"               
11  ( 1 ) "*"            "*"          "*"               
12  ( 1 ) "*"            "*"          "*"               
13  ( 1 ) "*"            "*"          "*"               
14  ( 1 ) "*"            "*"          "*"               
15  ( 1 ) "*"            "*"          "*"               
16  ( 1 ) "*"            "*"          "*"               
17  ( 1 ) "*"            "*"          "*"               
          Area_FarWest Area_GreatLakes Area_Southwest Area_Plains
1  ( 1 )  " "          " "             " "            " "        
2  ( 1 )  " "          " "             " "            " "        
3  ( 1 )  " "          " "             " "            " "        
4  ( 1 )  " "          " "             " "            " "        
5  ( 1 )  " "          " "             "*"            " "        
6  ( 1 )  " "          " "             "*"            " "        
7  ( 1 )  " "          " "             "*"            " "        
8  ( 1 )  " "          " "             "*"            " "        
9  ( 1 )  " "          "*"             "*"            " "        
10  ( 1 ) " "          "*"             "*"            " "        
11  ( 1 ) " "          "*"             "*"            " "        
12  ( 1 ) " "          "*"             "*"            " "        
13  ( 1 ) " "          "*"             "*"            " "        
14  ( 1 ) " "          "*"             "*"            " "        
15  ( 1 ) " "          "*"             "*"            " "        
16  ( 1 ) " "          "*"             "*"            "*"        
17  ( 1 ) "*"          "*"             "*"            "*"        
sub_model1<-lm(f1_1, data = avo_train)
yhat_train_stepwise1 <- predict(sub_model1, avo_train)
MSE_train_stepwise1 <- mean((avo_train$AveragePrice - yhat_train_stepwise1)^2)
MSE_train_stepwise1
[1] 0.06454066
yhat_test_stepwise1 <- predict(sub_model1, avo_test)
MSE_test_stepwise1 <- mean((avo_test$AveragePrice - yhat_test_stepwise1)^2)
MSE_test_stepwise1
[1] 387.6257

However, the test MSE is still way to large, therefore, we decide to remove the predictor one by one based on the result of summary. We try to remove “Area_Plains”, “Area_FarWest”, “LargeBags” one by one, we found that removing “LargeBags” can significantly reduce the test MSE to 0.1397455 which is the lowest MSE that we have tried. So the best fit model is f1_2.

f1_2 <- as.formula(AveragePrice ~ month + type_conventional + TotalVolume + 
                     PLU4046 + PLU4770 + PLU4225 + SmallBags + XLargeBags
                     + Area_NewEngland +
                   Area_Mideast + Area_RockyMountain
                   + Area_FarWest + Area_GreatLakes + Area_Southwest + Area_Plains
                   )
regfit.bwd2 = regsubsets(f1_2, data = avo_train, nvmax = 19, method = "backward")
summary(regfit.bwd2)
Subset selection object
Call: regsubsets.formula(f1_2, data = avo_train, nvmax = 19, method = "backward")
15 Variables  (and intercept)
                   Forced in Forced out
month                  FALSE      FALSE
type_conventional      FALSE      FALSE
TotalVolume            FALSE      FALSE
PLU4046                FALSE      FALSE
PLU4770                FALSE      FALSE
PLU4225                FALSE      FALSE
SmallBags              FALSE      FALSE
XLargeBags             FALSE      FALSE
Area_NewEngland        FALSE      FALSE
Area_Mideast           FALSE      FALSE
Area_RockyMountain     FALSE      FALSE
Area_FarWest           FALSE      FALSE
Area_GreatLakes        FALSE      FALSE
Area_Southwest         FALSE      FALSE
Area_Plains            FALSE      FALSE
1 subsets of each size up to 15
Selection Algorithm: backward
          month type_conventional TotalVolume PLU4046 PLU4770
1  ( 1 )  " "   "*"               " "         " "     " "    
2  ( 1 )  " "   "*"               " "         " "     " "    
3  ( 1 )  " "   "*"               " "         " "     " "    
4  ( 1 )  "*"   "*"               " "         " "     " "    
5  ( 1 )  "*"   "*"               " "         " "     " "    
6  ( 1 )  "*"   "*"               " "         " "     " "    
7  ( 1 )  "*"   "*"               " "         " "     " "    
8  ( 1 )  "*"   "*"               "*"         " "     " "    
9  ( 1 )  "*"   "*"               "*"         " "     " "    
10  ( 1 ) "*"   "*"               "*"         " "     " "    
11  ( 1 ) "*"   "*"               "*"         " "     " "    
12  ( 1 ) "*"   "*"               "*"         " "     "*"    
13  ( 1 ) "*"   "*"               "*"         "*"     "*"    
14  ( 1 ) "*"   "*"               "*"         "*"     "*"    
15  ( 1 ) "*"   "*"               "*"         "*"     "*"    
          PLU4225 SmallBags XLargeBags Area_NewEngland
1  ( 1 )  " "     " "       " "        " "            
2  ( 1 )  " "     " "       " "        " "            
3  ( 1 )  " "     " "       " "        "*"            
4  ( 1 )  " "     " "       " "        "*"            
5  ( 1 )  " "     " "       " "        "*"            
6  ( 1 )  " "     " "       " "        "*"            
7  ( 1 )  " "     " "       " "        "*"            
8  ( 1 )  " "     " "       " "        "*"            
9  ( 1 )  "*"     " "       " "        "*"            
10  ( 1 ) "*"     " "       " "        "*"            
11  ( 1 ) "*"     "*"       " "        "*"            
12  ( 1 ) "*"     "*"       " "        "*"            
13  ( 1 ) "*"     "*"       " "        "*"            
14  ( 1 ) "*"     "*"       "*"        "*"            
15  ( 1 ) "*"     "*"       "*"        "*"            
          Area_Mideast Area_RockyMountain Area_FarWest
1  ( 1 )  " "          " "                " "         
2  ( 1 )  "*"          " "                " "         
3  ( 1 )  "*"          " "                " "         
4  ( 1 )  "*"          " "                " "         
5  ( 1 )  "*"          " "                " "         
6  ( 1 )  "*"          "*"                " "         
7  ( 1 )  "*"          "*"                "*"         
8  ( 1 )  "*"          "*"                "*"         
9  ( 1 )  "*"          "*"                "*"         
10  ( 1 ) "*"          "*"                "*"         
11  ( 1 ) "*"          "*"                "*"         
12  ( 1 ) "*"          "*"                "*"         
13  ( 1 ) "*"          "*"                "*"         
14  ( 1 ) "*"          "*"                "*"         
15  ( 1 ) "*"          "*"                "*"         
          Area_GreatLakes Area_Southwest Area_Plains
1  ( 1 )  " "             " "            " "        
2  ( 1 )  " "             " "            " "        
3  ( 1 )  " "             " "            " "        
4  ( 1 )  " "             " "            " "        
5  ( 1 )  " "             "*"            " "        
6  ( 1 )  " "             "*"            " "        
7  ( 1 )  " "             "*"            " "        
8  ( 1 )  " "             "*"            " "        
9  ( 1 )  " "             "*"            " "        
10  ( 1 ) " "             "*"            "*"        
11  ( 1 ) " "             "*"            "*"        
12  ( 1 ) " "             "*"            "*"        
13  ( 1 ) " "             "*"            "*"        
14  ( 1 ) " "             "*"            "*"        
15  ( 1 ) "*"             "*"            "*"        
sub_model2<-lm(f1_2, data = avo_train)
yhat_train_stepwise2 <- predict(sub_model2, avo_train)
MSE_train_stepwise2 <- mean((avo_train$AveragePrice - yhat_train_stepwise2)^2)
MSE_train_stepwise2
[1] 0.06473134
yhat_test_stepwise2 <- predict(sub_model2, avo_test)
MSE_test_stepwise2 <- mean((avo_test$AveragePrice - yhat_test_stepwise2)^2)
MSE_test_stepwise2
[1] 0.1397455

Calculate the yhat price for the avo dataset which contain train and test dataset and merge them to the a single date frame.

yhat_avo_avgprice <- predict(sub_model2, avo)
df1_bwd<-avo %>% 
  select(Date, AveragePrice)
df2_bwd<-cbind(df1_bwd, yhat_avo_avgprice)
colnames(df2_bwd)
[1] "Date"              "AveragePrice"      "yhat_avo_avgprice"
names(df2_bwd)[3]<-"AveragePrice_hat"

Plot the actual average price and the predictive average price, the prediction for Backward selection is very similar with ridge and lasso based on the graph which will show as following.

plot1_bwd <- df2_bwd %>% 
  group_by(Date) %>% 
  summarize(
    MeanAvg=mean(AveragePrice),
    MeanAvg_hat=mean(AveragePrice_hat))%>% 
    ggplot()+
    geom_line(aes(Date, MeanAvg),color = "#356211")+
    geom_line(aes(Date, MeanAvg_hat), color = "#cda989")+
    theme_classic()
plot1_bwd + geom_vline(aes(xintercept = as.numeric(Date[113])), 
                   linetype = "dashed", size = 1,
                   color = "#AA471F")

run ridge

fit_ridge <- cv.glmnet(x1_train, y1_train, alpha = 0, nfolds = 100)
fit_ridge$lambda
  [1] 261.99201530 238.71736676 217.51037385 198.18735174
  [5] 180.58093365 164.53862122 149.92146362 136.60285400
  [9] 124.46743295 113.41008927 103.33504952  94.15504853
 [13]  85.79057353  78.16917543  71.22484134  64.89742274
 [17]  59.13211457  53.87898050  49.09252037  44.73127617
 [21]  40.75747288  37.13669133  33.83757002  30.83153357
 [25]  28.09254510  25.59688082  23.32292448  21.25098015
 [29]  19.36310165  17.64293707  16.07558716  14.64747631
 [33]  13.34623488  12.16059216  11.08027867  10.09593725
 [37]   9.19904201   8.38182448   7.63720629   6.95873793
 [41]   6.34054284   5.77726649   5.26403005   4.79638813
 [45]   4.37029023   3.98204569   3.62829173   3.30596429
 [49]   3.01227154   2.74466964   2.50084075   2.27867295
 [53]   2.07624193   1.89179432   1.72373252   1.57060088
 [57]   1.43107302   1.30394044   1.18810197   1.08255426
 [61]   0.98638312   0.89875557   0.81891260   0.74616266
 [65]   0.67987563   0.61947735   0.56444469   0.51430098
 [69]   0.46861190   0.42698171   0.38904983   0.35448772
 [73]   0.32299600   0.29430193   0.26815696   0.24433463
 [77]   0.22262862   0.20285090   0.18483019   0.16841038
 [81]   0.15344927   0.13981726   0.12739628   0.11607875
 [85]   0.10576663   0.09637061   0.08780931   0.08000858
 [89]   0.07290084   0.06642453   0.06052355   0.05514681
 [93]   0.05024772   0.04578385   0.04171654   0.03801056
 [97]   0.03463380   0.03155703   0.02875359   0.02619920

Predict response

yhat_train_ridge <- predict(fit_ridge, x1_train, s = fit_ridge$lambda)
yhat_test_ridge <- predict(fit_ridge, x1_test, s = fit_ridge$lambda)
mse_train_ridge = vector()
mse_test_ridge = vector()
mse_train_ridge <- mean((y1_train - yhat_train_ridge)^2)
mse_test_ridge <-mean((y1_test - yhat_test_ridge)^2)
for (i in 1:length(fit_ridge$lambda)) {
  mse_train_ridge[i] <- mean((y1_train - yhat_train_ridge)[,i]^2)
  mse_test_ridge[i] <- mean((y1_test - yhat_test_ridge)[,i]^2)
}
mse_train_ridge
  [1] 0.15012060 0.14949727 0.14943704 0.14937104 0.14929873
  [6] 0.14921953 0.14913279 0.14903783 0.14893387 0.14882010
 [11] 0.14869563 0.14855948 0.14841061 0.14824789 0.14807008
 [16] 0.14787589 0.14766389 0.14743256 0.14718028 0.14690530
 [21] 0.14660576 0.14627969 0.14592501 0.14553950 0.14512083
 [26] 0.14466658 0.14417419 0.14364104 0.14306439 0.14244146
 [31] 0.14176939 0.14104530 0.14026632 0.13942957 0.13853228
 [36] 0.13757171 0.13654533 0.13545073 0.13428578 0.13304862
 [41] 0.13173774 0.13035204 0.12889089 0.12735415 0.12574233
 [46] 0.12405657 0.12229876 0.12047153 0.11857835 0.11662358
 [51] 0.11461245 0.11255113 0.11044666 0.10830698 0.10614085
 [56] 0.10395775 0.10176778 0.09958151 0.09740983 0.09526373
 [61] 0.09315410 0.09109152 0.08908605 0.08714747 0.08528368
 [66] 0.08350210 0.08180885 0.08020874 0.07870524 0.07730049
 [71] 0.07599533 0.07478934 0.07368098 0.07266770 0.07174781
 [76] 0.07091361 0.07016229 0.06948863 0.06888726 0.06835268
 [81] 0.06787937 0.06746188 0.06709495 0.06677351 0.06649281
 [86] 0.06624837 0.06603608 0.06585213 0.06569309 0.06555584
 [91] 0.06543771 0.06533589 0.06524842 0.06517338 0.06510903
 [96] 0.06505385 0.06500653 0.06496546 0.06493110 0.06490090
mse_test_ridge
  [1] 0.1963519 0.1958176 0.1957660 0.1957095 0.1956476 0.1955798
  [7] 0.1955056 0.1954244 0.1953355 0.1952383 0.1951320 0.1950157
 [13] 0.1948887 0.1947499 0.1945983 0.1944329 0.1942524 0.1940556
 [19] 0.1938411 0.1936076 0.1933534 0.1930770 0.1927767 0.1924507
 [25] 0.1920972 0.1917141 0.1912995 0.1908514 0.1903675 0.1898458
 [31] 0.1892842 0.1886804 0.1880324 0.1873380 0.1865955 0.1858028
 [37] 0.1849584 0.1840607 0.1831086 0.1821010 0.1810374 0.1799175
 [43] 0.1787414 0.1775098 0.1762238 0.1748851 0.1734960 0.1720594
 [49] 0.1705790 0.1690590 0.1675044 0.1659210 0.1643153 0.1626943
 [55] 0.1610657 0.1594378 0.1578192 0.1562189 0.1546458 0.1531091
 [61] 0.1516174 0.1501791 0.1488020 0.1474937 0.1462592 0.1451038
 [67] 0.1440313 0.1430443 0.1421442 0.1413313 0.1406046 0.1399621
 [73] 0.1394009 0.1389173 0.1385057 0.1381629 0.1378829 0.1376603
 [79] 0.1374894 0.1373648 0.1372811 0.1372329 0.1372156 0.1372244
 [85] 0.1372551 0.1373040 0.1373676 0.1374427 0.1375268 0.1376173
 [91] 0.1377152 0.1378158 0.1379155 0.1380143 0.1381120 0.1382081
 [97] 0.1383017 0.1383898 0.1384754 0.1385589
min(mse_train_ridge)
[1] 0.0649009
min(mse_test_ridge)
[1] 0.1372156
lambda_min_mse_train_ridge<- fit_ridge$lambda[which.min(mse_train_ridge)]
lambda_min_mse_test_ridge <-fit_ridge$lambda[which.min(mse_test_ridge)]
lambda_min_mse_train_ridge
[1] 0.0261992
lambda_min_mse_test_ridge
[1] 0.1273963
yhat_train_ridge <- predict(fit_ridge, x1_train, s = 0.0261992)
yhat_test_ridge <- predict(fit_ridge, x1_test, s =   0.1160787)
yhat_1<- predict(fit_ridge, x1_avo, s = lambda_min_mse_test_ridge)

Aggregate data into one dataframe

p1<-avo %>%
  select(Date, AveragePrice)
p2<-cbind(p1, yhat_1)

Plot

p2 %>%
  group_by(Date)%>%
  summarise(meanpriced = mean(AveragePrice))
names(p2)[3] <- "pre"
p2 %>%
  group_by(Date) %>%
  summarise(meanpriced = mean(AveragePrice),meanpre = mean(pre))%>%
  ggplot()+
  geom_line(mapping = aes(x=Date,
                           y=meanpriced), col = "#356211")+
  geom_line(mapping = aes(x=Date, y= meanpre), col = "#cda989")+
  geom_vline(xintercept=as.numeric(as.Date("2017-03-01")), col = "#AA471F", linetype = "dashed") + 
  labs(title = "Ridge Regression",
       y = "Mean Price")

The ridge model is pretty much the same with lasso model, so we will see the difference between two models to see which model gives a more accurate prediction for our dataset after building a lasso model.

Run lasso model

lmodel <- glmnet(x1_train, y1_train, alpha = 1, nlambda = 100)
lmodel$lambda
 [1] 0.2619920153 0.2387173668 0.2175103739 0.1981873517
 [5] 0.1805809337 0.1645386212 0.1499214636 0.1366028540
 [9] 0.1244674330 0.1134100893 0.1033350495 0.0941550485
[13] 0.0857905735 0.0781691754 0.0712248413 0.0648974227
[17] 0.0591321146 0.0538789805 0.0490925204 0.0447312762
[21] 0.0407574729 0.0371366913 0.0338375700 0.0308315336
[25] 0.0280925451 0.0255968808 0.0233229245 0.0212509802
[29] 0.0193631016 0.0176429371 0.0160755872 0.0146474763
[33] 0.0133462349 0.0121605922 0.0110802787 0.0100959373
[37] 0.0091990420 0.0083818245 0.0076372063 0.0069587379
[41] 0.0063405428 0.0057772665 0.0052640301 0.0047963881
[45] 0.0043702902 0.0039820457 0.0036282917 0.0033059643
[49] 0.0030122715 0.0027446696 0.0025008407 0.0022786730
[53] 0.0020762419 0.0018917943 0.0017237325 0.0015706009
[57] 0.0014310730 0.0013039404 0.0011881020 0.0010825543
[61] 0.0009863831 0.0008987556 0.0008189126 0.0007461627
[65] 0.0006798756 0.0006194773 0.0005644447 0.0005143010
[69] 0.0004686119 0.0004269817 0.0003890498 0.0003544877
[73] 0.0003229960 0.0002943019 0.0002681570 0.0002443346

Predict response shows the minimue mse for the train(0.06472616) and test(0.1390842).

y1_train_hat <- predict(lmodel, s = lmodel$lambda, newx = x1_train)
y1_test_hat <- predict(lmodel, s = lmodel$lambda, newx = x1_test)
mse_train = vector()
mse_test = vector()
for (i in 1:length(lmodel$lambda)) {
  mse_train[i] <- mean((y1_train - y1_train_hat)[,i]^2)
  mse_test[i] <- mean((y1_test - y1_test_hat)[,i]^2)
}
min(mse_train)
[1] 0.06472616
min(mse_test)
[1] 0.1390842

Check the minimun lambda for the train and test dataset

lambda_min_mse_train_lasso<- lmodel$lambda[which.min(mse_train)]
lambda_min_mse_test_lasso <-lmodel$lambda[which.min(mse_test)]
lambda_min_mse_train_lasso
[1] 0.0002443346
lambda_min_mse_test_lasso
[1] 0.002500841

Using Cross-validation fucntion to find the best lambda, the result shows the best lambda for train dataset is same as above.

set.seed(1)
cv.out = cv.glmnet(x1_train, y1_train, alpha = 1)
## plot the lambda
plot(cv.out)

##check the best lambda
bestlam = cv.out$lambda.min
bestlam ## the best lamdba for training dataset is same as lambda_min_mse_train
[1] 0.0002443346

Check the coefficient for f1 model, the result shows that there are some predictors might be eliminated. “TotalVolume”, “PLU4225”, SmallBags“, and”Area_TotalUS“.

f1coef<-coef(lmodel, s = lambda_min_mse_test_lasso)
f1coef
20 x 1 sparse Matrix of class "dgCMatrix"
                               1
(Intercept)         1.494037e+00
month               1.455341e-02
type_conventional  -5.117866e-01
type_organic        1.250828e-11
TotalVolume         .           
PLU4046            -8.506223e-09
PLU4770             3.264729e-09
PLU4225             .           
SmallBags           .           
LargeBags          -4.707824e-08
XLargeBags          2.318309e-07
Area_NewEngland     2.294789e-01
Area_Southeast     -2.893281e-02
Area_Mideast        1.977691e-01
Area_RockyMountain -1.238285e-01
Area_FarWest        1.884770e-02
Area_GreatLakes    -2.182073e-02
Area_Southwest     -1.614236e-01
Area_Plains         2.748295e-02
Area_TotalUS        .           

Create a new formula for new predictors(eliminate the uncorrelated predictors)

f2 <- as.formula(AveragePrice ~ month + type_conventional + type_organic + 
                   PLU4046 + PLU4770 + LargeBags + XLargeBags + 
                   Area_NewEngland + Area_Southeast + Area_Mideast + 
                   Area_RockyMountain + Area_GreatLakes + 
                   Area_Southwest + Area_Plains)
x2_train <- model.matrix(f2,avo_train)[,-1]
x2_test <- model.matrix(f2, avo_test)[,-1]

Run lasso model again with new predictors

lmodel2 <- glmnet(x2_train, y1_train, alpha = 1, nlambda = 100)

Predict response with new predictors

y2_train_hat <- predict(lmodel2, s = lmodel2$lambda, newx = x2_train)
y2_test_hat <- predict(lmodel2, s = lmodel2$lambda, newx = x2_test)

Compute MES again with new predictors, the results shows the taining data MSE is increased when eliminate the uncorrelated predictors, and the train MSE (0.06482449) increases a little but the test MSE reduces a little bit (0.1390709).

mse_train1 = vector()
mse_test1 = vector()
for (i in 1:length(lmodel2$lambda)) {
  mse_train1[i] <- mean((y1_train - y2_train_hat)[,i]^2)
  mse_test1[i] <- mean((y1_test - y2_test_hat)[,i]^2)
}
min(mse_train1)
[1] 0.06482449
min(mse_test1)
[1] 0.1390709

Check again with the new minimum lambda for the train and test dataset

lambda_min_mse_train1<- lmodel2$lambda[which.min(mse_train1)]
lambda_min_mse_test1 <-lmodel2$lambda[which.min(mse_test1)]
lambda_min_mse_train1
[1] 0.0003890498
lambda_min_mse_test1
[1] 0.002500841

Check coefficients for f2, there is no more uncorrelated predictors in our model.

f2coef<-coef(lmodel2, s = lambda_min_mse_test1)
f2coef
15 x 1 sparse Matrix of class "dgCMatrix"
                               1
(Intercept)         1.512389e+00
month               1.454115e-02
type_conventional  -5.109947e-01
type_organic        1.145366e-11
PLU4046            -9.401350e-09
PLU4770             1.431715e-09
LargeBags          -4.939522e-08
XLargeBags          2.314856e-07
Area_NewEngland     2.107071e-01
Area_Southeast     -4.740335e-02
Area_Mideast        1.790446e-01
Area_RockyMountain -1.423959e-01
Area_GreatLakes    -4.033916e-02
Area_Southwest     -1.797424e-01
Area_Plains         9.054443e-03

Since the second model have the lowest test MSE, we decided to use “lmodel2” to make the final prediction.

x2_avo <- model.matrix(f2, avo)[,-1]
y2_avo_min_lambda_hat <- predict(lmodel2, s = lambda_min_mse_test1, newx = x2_avo)
class(y2_avo_min_lambda_hat)
[1] "matrix"

Aggregate data into one dataframe for the first model prediction

df1<-avo %>% 
  select(Date, AveragePrice)
df2<-cbind(df1, y2_avo_min_lambda_hat)
names(df2)[3]<-"AveragePrice_hat"

Plot the actual average price and the predictive average price

head(df2)
        Date AveragePrice AveragePrice_hat
1 2015-01-04         1.22        1.2265901
2 2015-01-04         1.00        0.9636202
3 2015-01-04         1.08        1.1943327
4 2015-01-04         1.01        0.8730412
5 2015-01-04         1.02        1.2265626
6 2015-01-04         1.40        1.1949496
class(df2$Date)
[1] "Date"
plot1 <- df2 %>% 
  group_by(Date) %>% 
  summarize(
    MeanAvg=mean(AveragePrice),
    MeanAvg_hat=mean(AveragePrice_hat)) %>%
  ggplot()+
  geom_line(aes(Date, MeanAvg),color = "#356211")+
  geom_line(aes(Date, MeanAvg_hat), color = "#cda989")+
  theme_classic()
plot1 + geom_vline(aes(xintercept = as.numeric(Date[113])), 
                   linetype = "dashed", size = 1,
                   color = "#AA471F")

We can see that our ridge model actually did a better job than lasso. Ridge is more flexibly than lasso, maybe each variable have something to do in our data set, so ridge works better than lasso.

We use rpart function to build decision trees.

fit.tree <- rpart(f1,
                  avo_train,
                  control = rpart.control(cp = 0.001))
par(xpd = TRUE)
plot(fit.tree, compress=TRUE)
text(fit.tree, use.n=TRUE)

rpart.plot(fit.tree, type = 1)

yhat.train.tree <- predict(fit.tree, avo_train)
mse.train.tree <- mean((avo_train$AveragePrice - yhat.train.tree)^2)
mse.train.tree
[1] 0.03348875
yhat.test.tree <- predict(fit.tree, avo_test)
mse.test.tree <- mean((avo_test$AveragePrice - yhat.test.tree)^2)
mse.test.tree
[1] 0.1470199

The outcome is too difficult to read, even after we adjust the type. The test MSE is 0.1468156, which is greater than Lasso. Decision trees make little sense in predicting prices, and we trained this one just for practice.

We use tree function to create regression trees.

tree.avo =tree(f1, avo_train)
summary(tree.avo)

Regression tree:
tree(formula = f1, data = avo_train)
Variables actually used in tree construction:
[1] "type_organic" "PLU4046"      "Area_Mideast" "LargeBags"   
[5] "PLU4225"      "month"       
Number of terminal nodes:  9 
Residual mean deviance:  0.05779 = 704.7 / 12190 
Distribution of residuals:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.93200 -0.15470 -0.01157  0.00000  0.13910  1.25200 
cv.avo =cv.tree(tree.avo)
prune.avo =prune.tree(tree.avo,best =5)
plot(prune.avo)
text(prune.avo,pretty =0)

yhat <- predict(tree.avo, newdata = avo_test)
mse_regreTree_test <- mean((yhat - avo_test$AveragePrice)^2)
mse_regreTree_test
[1] 0.1499934

The terminal nodes show the price per avocado in each branch, and the test MSE is 0.1499934, even greater than Decision Tree.

We use randomForest function to build bagging trees by setting mtry to the number of predictors in our base function f1.

set.seed (1)
bag.avo =randomForest(f1, data = avo_train, 
                           mtry = 19, importance =TRUE)
bag.avo

Call:
 randomForest(formula = f1, data = avo_train, mtry = 19, importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 19

          Mean of squared residuals: 0.01439899
                    % Var explained: 90.41

Then we create a scatter plot of test prediction and test actual prices, and add a trend line in 45 degrees.

yhat.bag = predict (bag.avo, newdata = avo_test)
plot(yhat.bag, avo_test$AveragePrice, main = "Scatter Plot for Bagging",
     xlab = "Prediction price in test set", ylab = "Average price in test set", 
     col = "#bdcc64")
abline (0,1)

mse_bag_test <- mean((yhat.bag - avo_test$AveragePrice)^2)
mse_bag_test
[1] 0.1350875

As we can see from this graph, the scatters are clustered above the trendline. Ideally, the scatters should lie on top of the trend line. The test MSE is 0.1350875, and this should be treated as the bottomline for other tree models.

Building the model first:

fit_rf <- randomForest(f1,
                       avo_train,
                       ntree=300,
                       do.trace=F)
varImpPlot(fit_rf)

Predicting our y_hats for train and for test data:

yhat_rf_train <- predict(fit_rf, avo_train)
yhat_rf_test <- predict(fit_rf, avo_test)

Calculating MSE for both train and test:

mse_rf_train <- mean((yhat_rf_train - y1_train) ^ 2)
mse_rf_test <- mean((yhat_rf_test - y1_test)^2)
print(mse_rf_train)
[1] 0.003639148
print(mse_rf_test)
[1] 0.1279945

Plotting actual prices and predicted prices on one plot: Have to prepare data to use it for the plot first by creating a dataframe with Date, AveragePrice and Predicted Price (y_hat) for both train and test:

avo %>% 
  select(Date, AveragePrice) -> plot_rf_full
yhat_rf_full <- c(yhat_rf_train,yhat_rf_test )
cbind(plot_rf_full, yhat_rf_full) -> plot_rf_full1  
plot_rf_full1 %>% 
  group_by(Date) %>% 
  summarise(mean_y = mean(AveragePrice),
            mean_yhat = mean(yhat_rf_full)) %>% 
  ggplot() +
  geom_line(aes(x = Date, y = mean_y), col = "#356211") +
  geom_line(aes(x = Date, y = mean_yhat), col = "#cda989") -> rf_graph
 
rf_graph + geom_vline(aes(xintercept = as.numeric(Date[113])), 
                   linetype = "dashed", size = 1,
                   color = "#AA471F") + 
  labs(title = "Random Forest",
       y = "Mean Price")

As we can see on the graph above, with 300 trees, random forest fit train data pretty well, however it is doing bad with test data. Nevertheless, the graph looks better than other less flexible models we have used.

We could have tried to increase the number of trees to see if it would give better estimations, however, unfortunately, the capacity of RStudios that we have used is not enough to run random forests with more than 300 trees.

We use the gbm function to run boosting trees. We set the number of trees to be 150, which is half of the trees in Random Forest. The interaction.depth is set to 4 and shrinkage is 0.1. We add a commend in this function, cv.folds, to apply KNN cross-validation in traing, and we split the training set into 10 folds.

boostingcv <- gbm(f1,
                  data = avo_train,
                  distribution = "gaussian",
                  n.trees = 150,
                  interaction.depth = 4,
                  cv.folds = 10,
                  shrinkage = 0.1)
NAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercion
relative.influence(boostingcv)
n.trees not given. Using 150 trees.
             month  type_conventional       type_organic 
        179.480280        1109.504577         953.431615 
       TotalVolume            PLU4046            PLU4770 
         97.033943         398.272468          92.155986 
           PLU4225          SmallBags          LargeBags 
        199.758316          98.467418         518.245860 
        XLargeBags    Area_NewEngland     Area_Southeast 
          9.652739          85.927019          26.551777 
      Area_Mideast Area_RockyMountain       Area_FarWest 
         99.033165          16.475180          49.161052 
   Area_GreatLakes     Area_Southwest        Area_Plains 
         15.895736          60.478064           5.685901 
      Area_TotalUS 
          1.326137 
yhat_btree <- predict(boostingcv, avo_train, n.trees = 150)
mse_btree <- mean((yhat_btree - y1_train) ^ 2)
print(mse_btree)
[1] 0.02992749

Here we get the train MSE for boosting. Then we calculate the test MSE.

yhat_btree_test <- predict(boostingcv, avo_test, n.trees = 150)
mse_btree_test <- mean((yhat_btree_test - y1_test) ^ 2)
print(mse_btree_test)
[1] 0.1293277

Next, we add the boosting predictions into train and test sets to plot the trend for predictions and actuals.

avo_train$prediction_btree <- yhat_btree
avo_test$prediction_btree <- yhat_btree_test
avo_plot <- rbind(avo_train, avo_test)
btree_plot <- avo_plot %>% 
  group_by(Date) %>% 
  summarise(meanAvg = mean(AveragePrice),
            meanAvg_hat = mean(prediction_btree)) %>% 
  ggplot() +
  geom_line(aes(Date, meanAvg), col = "#356211") + 
  geom_line(aes(Date, meanAvg_hat), col = "#cda989") + 
  labs(title = "Boosting",
       y = "Mean Price") +
  theme_clean()
btree_plot + 
  geom_vline(aes(xintercept = as.numeric(Date[113])),
             linetype = "dashed", size = 1, col = "#AA471F")

The boosting model gives a good illustration in seasonality, though the tset MSE is a little bit higher then Random Forest. At the end of this step, we prefer Random Forest as our choice of model

ds1 <- read_csv("avocado3.csv")
Missing column names filled in: 'X1' [1]Parsed with column specification:
cols(
  X1 = col_double(),
  Date = col_date(format = ""),
  AveragePrice = col_double(),
  `Total Volume` = col_double(),
  `4046` = col_double(),
  `4225` = col_double(),
  `4770` = col_double(),
  `Total Bags` = col_double(),
  `Small Bags` = col_double(),
  `Large Bags` = col_double(),
  `XLarge Bags` = col_double(),
  type = col_character(),
  year = col_double(),
  region = col_character(),
  NewPrice = col_double(),
  NewPrice2 = col_double()
)
glimpse(ds1)
Observations: 18,249
Variables: 16
$ X1             <dbl> 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 5…
$ Date           <date> 2015-01-04, 2015-01-04, 2015-01-04, 2015…
$ AveragePrice   <dbl> 1.22, 1.00, 1.08, 1.01, 1.02, 1.40, 0.93,…
$ `Total Volume` <dbl> 40873.28, 435021.49, 788025.06, 80034.32,…
$ `4046`         <dbl> 2819.50, 364302.39, 53987.31, 44562.12, 7…
$ `4225`         <dbl> 28287.42, 23821.16, 552906.04, 24964.23, …
$ `4770`         <dbl> 49.90, 82.15, 39995.03, 2752.35, 128.82, …
$ `Total Bags`   <dbl> 9716.46, 46815.79, 141136.68, 7755.62, 87…
$ `Small Bags`   <dbl> 9186.93, 16707.15, 137146.07, 6064.30, 87…
$ `Large Bags`   <dbl> 529.53, 30108.64, 3990.61, 1691.32, 256.2…
$ `XLarge Bags`  <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 3375.…
$ type           <chr> "conventional", "conventional", "conventi…
$ year           <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015,…
$ region         <chr> "Albany", "Atlanta", "BaltimoreWashington…
$ NewPrice       <dbl> 0.00, 1.22, 1.00, 1.08, 1.01, 1.02, 1.40,…
$ NewPrice2      <dbl> 0.00, 0.00, 1.22, 1.00, 1.08, 1.01, 1.02,…

The new dataset called avocado3 is the dataset that we created based on the original avocado dataset. We create two more columns called “NewPrice” and “NewPrice2” which was based on the previous price that we have as “AveragePrice”. The NewPrice was come from the AveragePrice of one row above and the “NewPrice2” was from the AveragePrice of two rows above. We are going to add these two columns as new variables to do a more accurate prediction for our time-series based dataset.

ds1 <- ds1 %>% 
  group_by(type, region) %>% 
  select(X1, year, Date, type, region, everything()) %>% 
  arrange(Date)
ds1 <- ds1 %>% 
  group_by(type, region) %>% 
  select(X1, year, Date, type, region, everything()) %>% 
  arrange(Date)
colName <- names(ds1)
colName[1] <- "ID"
names(ds1) <- colName
ds1$ID <- seq(nrow(ds1))
ds1$month <- month(ds1$Date)
ds1 <- ds1 %>% 
  select(ID, year, month, everything())
ds1New <- dummy_cols(ds1, select_columns = "type") %>% 
  select(ID, year, month, region, type_conventional, type_organic, 
         everything(), -type)
ds1New$other_PLU <- ds1New$`Total Volume` - ds1New$`4046` - ds1New$`4225` - ds1New$`4770`
ds1New <- ds1New %>% 
  select(1:3, Date, everything())
uniqueRegion <- unique(ds1New$region)
uniqueRegion <- as.data.frame(uniqueRegion)
uniqueRegion$Area <- NA
uniqueRegion$Area[1] <- "NewEngland"
uniqueRegion$Area[2] <- "Southeast"
uniqueRegion$Area[3] <- "Mideast"
uniqueRegion$Area[4] <- "RockyMountain"
uniqueRegion$Area[5] <- "NewEngland"
uniqueRegion$Area[6] <- "Mideast"
uniqueRegion$Area[7] <- "FarWest"
uniqueRegion$Area[8] <- "Southeast"
uniqueRegion$Area[9] <- "GreatLakes"
uniqueRegion$Area[10] <- "GreatLakes"
uniqueRegion$Area[11] <- "GreatLakes"
uniqueRegion$Area[12] <- "Southwest"
uniqueRegion$Area[13] <- "RockyMountain"
uniqueRegion$Area[14] <- "GreatLakes"
uniqueRegion$Area[15] <- "GreatLakes"
uniqueRegion$Area[16] <- "GreatLakes"
uniqueRegion$Area[17] <- "Mideast"
uniqueRegion$Area[18] <- "NewEngland"
uniqueRegion$Area[19] <- "Southeast"
uniqueRegion$Area[20] <- "GreatLakes"
uniqueRegion$Area[21] <- "Southeast"
uniqueRegion$Area[22] <- "FarWest"
uniqueRegion$Area[23] <- "FarWest"
uniqueRegion$Area[24] <- "Southeast"
uniqueRegion$Area[25] <- "Southeast"
uniqueRegion$Area[26] <- "Southeast"
uniqueRegion$Area[27] <- "Southeast"
uniqueRegion$Area[28] <- "Southeast"
uniqueRegion$Area[29] <- "Mideast"
uniqueRegion$Area[30] <- "NewEngland"
uniqueRegion$Area[31] <- "NewEngland"
uniqueRegion$Area[32] <- "Southeast"
uniqueRegion$Area[33] <- "Mideast"
uniqueRegion$Area[34] <- "Southwest"
uniqueRegion$Area[35] <- "Mideast"
uniqueRegion$Area[36] <- "Plains"
uniqueRegion$Area[37] <- "FarWest"
uniqueRegion$Area[38] <- "Southeast"
uniqueRegion$Area[39] <- "Southeast"
uniqueRegion$Area[40] <- "Southeast"
uniqueRegion$Area[41] <- "FarWest"
uniqueRegion$Area[42] <- "FarWest"
uniqueRegion$Area[43] <- "FarWest"
uniqueRegion$Area[44] <- "FarWest"
uniqueRegion$Area[45] <- "Southeast"
uniqueRegion$Area[46] <- "Southeast"
uniqueRegion$Area[47] <- "Southeast"
uniqueRegion$Area[48] <- "FarWest"
uniqueRegion$Area[49] <- "Plains"
uniqueRegion$Area[50] <- "Mideast"
uniqueRegion$Area[51] <- "Southeast"
uniqueRegion$Area[52] <- "TotalUS"
uniqueRegion$Area[53] <- "FarWest"
uniqueRegion$Area[54] <- "Southwest"
names(uniqueRegion)[1] <- "region"
avo <- ds1New %>% 
  left_join(uniqueRegion, by = "region") %>% 
  select(1:5, Area, everything())
avo <- dummy_cols(avo, select_columns = "Area")
names(avo)[10] <- "TotalVolume"
names(avo)[14] <- "TotalBags"
names(avo)[15] <- "SmallBags"
names(avo)[16] <- "LargeBags"
names(avo)[17] <- "XLargeBags"
names(avo)[11] <- "PLU4046"
names(avo)[12] <- "PLU4225"
names(avo)[13] <- "PLU4770"
set.seed(1234)
avo_train <- avo %>% filter(as.Date(Date) < "2017-03-01")
avo_train %>%
  filter(year == 2017, month == 2)
avo_test <- avo %>% filter(as.Date(Date) >= "2017-03-01")
avo_test %>%
  filter(year == 2018, month == 3)

pre model

f2 <- as.formula(AveragePrice ~ month +type_conventional + type_organic + TotalVolume + 
                   PLU4046 + PLU4770 + PLU4225 + SmallBags + LargeBags + XLargeBags + 
                   + Area_NewEngland + Area_Southeast
                 + Area_Mideast + Area_RockyMountain
                 + Area_FarWest 
                 + Area_GreatLakes + Area_Southwest
                 + Area_Plains + Area_TotalUS + NewPrice + NewPrice2)
x1_train <- model.matrix(f2,avo_train)[,-1]
y1_train <- avo_train$AveragePrice
x1_test <- model.matrix(f2, avo_test)[,-1]
y1_test <- avo_test$AveragePrice
date_test <- avo_test$Date
x1_avo <- model.matrix(f2, avo)[,-1]

Run lasso

lmodel <- glmnet(x1_train, y1_train, alpha = 1, nlambda = 100)
lmodel$lambda
 [1] 0.2619920153 0.2387173668 0.2175103739 0.1981873517
 [5] 0.1805809337 0.1645386212 0.1499214636 0.1366028540
 [9] 0.1244674330 0.1134100893 0.1033350495 0.0941550485
[13] 0.0857905735 0.0781691754 0.0712248413 0.0648974227
[17] 0.0591321146 0.0538789805 0.0490925204 0.0447312762
[21] 0.0407574729 0.0371366913 0.0338375700 0.0308315336
[25] 0.0280925451 0.0255968808 0.0233229245 0.0212509802
[29] 0.0193631016 0.0176429371 0.0160755872 0.0146474763
[33] 0.0133462349 0.0121605922 0.0110802787 0.0100959373
[37] 0.0091990420 0.0083818245 0.0076372063 0.0069587379
[41] 0.0063405428 0.0057772665 0.0052640301 0.0047963881
[45] 0.0043702902 0.0039820457 0.0036282917 0.0033059643
[49] 0.0030122715 0.0027446696 0.0025008407 0.0022786730
[53] 0.0020762419 0.0018917943 0.0017237325 0.0015706009
[57] 0.0014310730 0.0013039404 0.0011881020 0.0010825543
[61] 0.0009863831 0.0008987556 0.0008189126 0.0007461627
[65] 0.0006798756 0.0006194773 0.0005644447 0.0005143010
[69] 0.0004686119 0.0004269817 0.0003890498 0.0003544877
[73] 0.0003229960 0.0002943019 0.0002681570 0.0002443346
[77] 0.0002226286

Predict response

y1_train_hat <- predict(lmodel, s = lmodel$lambda, newx = x1_train)
y1_test_hat <- predict(lmodel, s = lmodel$lambda, newx = x1_test)
length(y1_test_hat)
[1] 465542
mse_train = vector()
mse_test = vector()
for (i in 1:length(lmodel$lambda)) {
  mse_train[i] <- mean((y1_train - y1_train_hat)[,i]^2)
  mse_test[i] <- mean((y1_test - y1_test_hat)[,i]^2)
}
mse_train
 [1] 0.15012060 0.13846677 0.12879155 0.12075901 0.11409026
 [6] 0.10855374 0.10395723 0.10014113 0.09697293 0.09434263
[11] 0.09199804 0.08978030 0.08792382 0.08620959 0.08478002
[16] 0.08255566 0.08003344 0.07794176 0.07620517 0.07449818
[21] 0.07296746 0.07155953 0.07038881 0.06941684 0.06860990
[26] 0.06794111 0.06734291 0.06678767 0.06631274 0.06591720
[31] 0.06555875 0.06524122 0.06497820 0.06475930 0.06457757
[36] 0.06442670 0.06429494 0.06416733 0.06405669 0.06396450
[41] 0.06388794 0.06382482 0.06377207 0.06372820 0.06369136
[46] 0.06365996 0.06363365 0.06360427 0.06357663 0.06355334
[51] 0.06351938 0.06349116 0.06345885 0.06342691 0.06340217
[56] 0.06338119 0.06336399 0.06334973 0.06333763 0.06332777
[61] 0.06331962 0.06331264 0.06330724 0.06330286 0.06329890
[66] 0.06329546 0.06329288 0.06328888 0.06328599 0.06328380
[71] 0.06328181 0.06328004 0.06327853 0.06327746 0.06327648
[76] 0.06327561 0.06327503
mse_test
 [1] 0.1963519 0.1866676 0.1787868 0.1723894 0.1672105 0.1630314
 [7] 0.1596718 0.1569827 0.1548414 0.1531467 0.1511442 0.1483245
[13] 0.1459111 0.1431670 0.1408326 0.1385649 0.1364580 0.1347493
[19] 0.1333634 0.1320312 0.1308327 0.1296515 0.1286929 0.1279137
[25] 0.1272820 0.1267663 0.1262994 0.1259077 0.1256045 0.1253750
[31] 0.1251657 0.1249740 0.1248329 0.1247305 0.1246567 0.1246056
[37] 0.1245521 0.1244743 0.1243997 0.1243466 0.1243063 0.1242713
[43] 0.1242488 0.1242333 0.1242222 0.1242172 0.1242134 0.1241673
[49] 0.1240975 0.1240429 0.1240655 0.1240820 0.1241052 0.1240991
[55] 0.1240995 0.1241009 0.1241053 0.1241115 0.1241174 0.1241246
[61] 0.1241325 0.1241400 0.1241477 0.1241602 0.1241680 0.1241748
[67] 0.1241816 0.1241928 0.1242047 0.1242185 0.1242335 0.1242481
[73] 0.1242621 0.1242733 0.1242854 0.1242967 0.1243042
min(mse_test)
[1] 0.1240429
lambda_min_mse_train<- lmodel$lambda[which.min(mse_train)]
lambda_min_mse_test <-lmodel$lambda[which.min(mse_test)]
lambda_min_mse_train
[1] 0.0002226286
lambda_min_mse_test
[1] 0.00274467

Using Cross-validation fucntion to find the best lambda

set.seed(1)
cv.out = cv.glmnet(x1_train, y1_train, alpha = 1)

Plot the lambda

plot(cv.out,
     xlab = "Log(lambda)")

Check the best lambda

bestlam = cv.out$lambda.min
bestlam ## the best lamdba for training dataset is same as lambda_min_mse_train
[1] 0.0002226286

Create a new formula for new predictors(eliminate the uncorrelated predictors)

f3 <- as.formula(AveragePrice ~ month + type_conventional + type_organic + 
                   PLU4046 + PLU4770 + PLU4225 + LargeBags + XLargeBags + 
                   Area_NewEngland + Area_Southeast + Area_Mideast + 
                   Area_RockyMountain + Area_GreatLakes + 
                   Area_Southwest + Area_Plains)
x2_train <- model.matrix(f3,avo_train)[,-1]
x2_test <- model.matrix(f3, avo_test)[,-1]

Run lasso model again with new predictors

lmodel2 <- glmnet(x2_train, y1_train, alpha = 1, nlambda = 100)

Predict response with new predictors

y2_train_hat <- predict(lmodel2, s = lmodel2$lambda, newx = x2_train)
y2_test_hat <- predict(lmodel2, s = lmodel2$lambda, newx = x2_test)

Compute MES again with new predictors. The results shows the taining data MSE is increased when eliminate the predictors which don’t have correlation but the test data MSE still keep the same.

mse_train1 = vector()
mse_test1 = vector()
for (i in 1:length(lmodel2$lambda)) {
  mse_train1[i] <- mean((y1_train - y2_train_hat)[,i]^2)
  mse_test1[i] <- mean((y1_test - y2_test_hat)[,i]^2)
}
mse_train1
 [1] 0.15012060 0.13846677 0.12879155 0.12075901 0.11409026
 [6] 0.10855374 0.10395723 0.10014113 0.09697293 0.09434263
[11] 0.09215892 0.09034596 0.08884081 0.08759121 0.08655376
[16] 0.08431018 0.08196395 0.07990953 0.07779947 0.07604766
[21] 0.07449047 0.07309754 0.07194110 0.07098100 0.07018392
[26] 0.06952091 0.06887432 0.06830444 0.06782869 0.06743077
[31] 0.06707446 0.06675856 0.06649647 0.06627891 0.06609828
[36] 0.06594833 0.06579625 0.06565543 0.06553841 0.06544197
[41] 0.06536123 0.06529218 0.06523639 0.06518931 0.06515058
[46] 0.06511806 0.06509062 0.06505655 0.06502858 0.06500546
[51] 0.06498445 0.06495657 0.06492196 0.06489168 0.06486665
[56] 0.06484615 0.06482917 0.06481507 0.06480312 0.06479365
[61] 0.06478540 0.06477871 0.06477295 0.06476919 0.06476505
[66] 0.06476156 0.06475901 0.06475670 0.06475470 0.06475300
[71] 0.06475183 0.06475049 0.06474963 0.06474884
mse_test1
 [1] 0.1963519 0.1866676 0.1787868 0.1723894 0.1672105 0.1630314
 [7] 0.1596718 0.1569827 0.1548414 0.1531467 0.1518155 0.1507794
[13] 0.1499820 0.1493773 0.1489275 0.1480109 0.1471125 0.1463090
[19] 0.1450664 0.1441028 0.1432008 0.1423409 0.1416688 0.1411489
[25] 0.1407521 0.1404541 0.1401995 0.1399675 0.1398037 0.1397067
[31] 0.1396211 0.1395406 0.1394979 0.1394826 0.1394880 0.1395091
[37] 0.1394704 0.1394028 0.1393537 0.1393220 0.1392987 0.1392808
[43] 0.1392813 0.1392815 0.1392885 0.1392955 0.1392985 0.1392217
[49] 0.1391593 0.1391064 0.1390709 0.1391076 0.1391526 0.1391682
[55] 0.1391818 0.1391983 0.1392159 0.1392337 0.1392502 0.1392675
[61] 0.1392842 0.1392995 0.1393142 0.1393305 0.1393470 0.1393587
[67] 0.1393693 0.1393810 0.1393914 0.1394004 0.1394073 0.1394171
[73] 0.1394228 0.1394295
min(mse_train1)
[1] 0.06474884
min(mse_test1)
[1] 0.1390709
lambda_min_mse_train1<- lmodel2$lambda[which.min(mse_train1)]
lambda_min_mse_test1 <-lmodel2$lambda[which.min(mse_test1)]
lambda_min_mse_train1
[1] 0.0002943019
lambda_min_mse_test1
[1] 0.002500841

Check coefficients for f2 and f3 model

f2coef<-coef(lmodel, s = lambda_min_mse_test)
f3coef<-coef(lmodel2, s = lambda_min_mse_test1)
f2coef 
22 x 1 sparse Matrix of class "dgCMatrix"
                               1
(Intercept)         1.240517e+00
month               1.193989e-02
type_conventional  -4.315907e-01
type_organic        1.037824e-11
TotalVolume         .           
PLU4046            -9.803572e-09
PLU4770             1.862104e-10
PLU4225             .           
SmallBags           .           
LargeBags          -3.217985e-08
XLargeBags          1.559099e-07
Area_NewEngland     2.244133e-01
Area_Southeast     -2.055361e-02
Area_Mideast        2.041316e-01
Area_RockyMountain -1.001925e-01
Area_FarWest        2.251057e-02
Area_GreatLakes    -6.889180e-03
Area_Southwest     -1.465750e-01
Area_Plains         3.874623e-02
Area_TotalUS        .           
NewPrice            9.836736e-02
NewPrice2           6.594030e-02
f3coef
16 x 1 sparse Matrix of class "dgCMatrix"
                               1
(Intercept)         1.512389e+00
month               1.454115e-02
type_conventional  -5.109947e-01
type_organic        1.145366e-11
PLU4046            -9.401350e-09
PLU4770             1.431715e-09
PLU4225             .           
LargeBags          -4.939522e-08
XLargeBags          2.314856e-07
Area_NewEngland     2.107071e-01
Area_Southeast     -4.740335e-02
Area_Mideast        1.790446e-01
Area_RockyMountain -1.423959e-01
Area_GreatLakes    -4.033916e-02
Area_Southwest     -1.797424e-01
Area_Plains         9.054443e-03

Since the second model have the training MSE increased, the ideal model is still the first “lmodel”, so we decide to use “lmodel” to run the prediction

y1_avo_min_lambda_hat <- predict(lmodel, s = lambda_min_mse_test, newx = x1_avo)
class(y1_avo_min_lambda_hat)
[1] "matrix"

Aggregate data into one dataframe for the first model prediction

df2<-avo %>% 
  select(Date, AveragePrice)
df3<-cbind(df2, y1_avo_min_lambda_hat)
colnames(df3)
[1] "Date"         "AveragePrice" "1"           
names(df3)[3]<-"AveragePrice_hat"

Plot the actual average price and the predictive average price

head(df3)
        Date AveragePrice AveragePrice_hat
1 2015-01-04         1.22        1.0452352
2 2015-01-04         1.00        0.9157808
3 2015-01-04         1.08        1.2031624
4 2015-01-04         1.01        0.8923604
5 2015-01-04         1.02        1.2157677
6 2015-01-04         1.40        1.1919005
class(df3$Date)
[1] "Date"
 df3 %>% 
  group_by(Date) %>%
  summarise(meanpriced = mean(AveragePrice),meanpre = mean(AveragePrice_hat))%>%
  ggplot()+
  geom_line(mapping = aes(x=Date,
                           y=meanpriced), col = "#356211")+
  geom_line(mapping = aes(x=Date, y= meanpre), col = "#cda989")+
  geom_vline(xintercept=as.numeric(as.Date("2017-03-01")), col = "#AA471F", linetype = "dashed", size = 1)+
   labs(title = "Lasso with new variables",
        y = "Mean Price") +
   theme_classic()

We can see from this graph, that compare with the regular lasso that we did before, this graph did a much better prediction with our dataset. It not only showed a season based change for avocado price but also have some sort of change within one year period. However, Although this model have the lowest Test MSE, we can see that the prediction is not as good as some model that we have before. So it require us to work further on this time series dataset to get more accurate prediction.

Because we are working with time series dataset, it is reasonable to assume that the previous day price plays a big part in predicting a next day price. Therefore, we have tried to create a code that would predict one day price and use it as a predictor for the next day price.

Let’s make a new dataframe with lagged 1 day price as an additional variable, and change it to zero for test data (as in real life we would not know the previous day price):

avo_ts <- avo %>% 
          mutate(lag1_y = lag(AveragePrice))
dim(avo_ts)
[1] 18249    28
avo_ts <- drop_na(avo_ts)
dim(avo_ts)
[1] 18248    28
avo_train_ts <- avo_ts[1:12202, ]
avo_test_ts <- avo_ts[12203:18248, ]
dim(avo_test_ts)
[1] 6046   28
avo_test_ts[2:6046,"lag1_y"] <- 0

Running a loop to predict each day using previous day prediction:

f_ts <- as.formula(AveragePrice ~ lag1_y + month + type_conventional + type_organic + 
                           PLU4046 + PLU4770 + PLU4225 + LargeBags + XLargeBags + 
                           Area_NewEngland + Area_Southeast + Area_Mideast + 
                           Area_RockyMountain + Area_GreatLakes + 
                           Area_Southwest + Area_Plains)
y_hat_ts_test_f <- avo_test_ts[1,"lag1_y"]
avo_ts1 <- avo_ts
lambda_min_test <- 0.001570601 #the min lambda from lasso model above
check_ts <- vector()
for (i in 1:6045) {
  avo_train_ts <- avo_ts1[1:(12203+i), ]
  avo_test_ts <- avo_ts1[(12204+i):18248, ]
  
  x_ts_train <- model.matrix(f_ts,avo_train_ts)[,-1]
  x_ts_test <- model.matrix(f_ts, avo_test_ts)[,-1]
  
  y_ts_train <- avo_train_ts$AveragePrice
  y_ts_test <- avo_test_ts$AveragePrice
  
  lmodel_ts <- glmnet(x_ts_train, y_ts_train, alpha = 1, nlambda = 100)
  
  y_ts_train_hat <- predict(lmodel_ts, s = lambda_min_test, newx = x_ts_train)
  y_ts_test_hat <- predict(lmodel_ts, s = lambda_min_test, newx = x_ts_test)
  y_hat_ts_test_f <- c(y_hat_ts_test_f, y_ts_test_hat[1])
  (length(y_hat_ts_test_f))
  avo_ts1[(12204+i), "lag1_y"] <- y_ts_test_hat[1]
  
}
Error in cbind2(1, newx) %*% nbeta : 
  Cholmod error 'X and/or Y have wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 90

The loop gives back an error, which we, unfortunately, could not resolve (Also because this loop takes long time to run).

However, it somehow worked to gather prediction based on previous day prediction as variable. We can see it in here:

length(y_hat_ts_test_f)
[1] 6044
y_hat_ts_test_f <- unlist(y_hat_ts_test_f)

It is supposed to have 6046 numbers, but maybe due to the error it lost 2 numbers somewhere in the loop. We suppose that 2 out of 6046 will not add a significant change, so in order to calculate MSE properly, we have decided to substitute the lost 2 numbers with an average of known 6044:

y_hat_ts_test_f <- c(y_hat_ts_test_f, mean(y_hat_ts_test_f), mean(y_hat_ts_test_f))
length(y_hat_ts_test_f)
[1] 6046

We need to also readjust test data because of the loop changes:

avo_test_ts <- avo_ts[12203:18248,]
y_ts_test <- avo_test_ts$AveragePrice

We now can calculate MSE:

mse_ts_test_f <- mean((y_ts_test - y_hat_ts_test_f)^2)
print(mse_ts_test_f)
[1] 0.1113422

The MSE shown above is the lowest MSE among all models we have run. This may implicate that for time series dataset it is better to use previous day prediction as a variable for next day prediction. However, it is quite incovenient and time-consuming to adjust models for variables and predict each row separately.

Let’s see if plot can show the difference.

Need to select and prepare data for the plot first:

avo %>% 
  select(Date, AveragePrice) -> plot_ts_full
plot_ts_full <- plot_ts_full[2:nrow(plot_ts_full),]
y_hat_train_ts <-  y_ts_train_hat[1:12202,]
y_hat_ts_full <- c(y_hat_train_ts, y_hat_ts_test_f )
plot_ts <- cbind(plot_ts_full, y_hat_ts_full)

Plot:

plot_ts %>% 
  group_by(Date) %>% 
  summarise(mean_y = mean(AveragePrice),
            mean_yhat = mean(y_hat_ts_full)) %>% 
  ggplot() +
  geom_line(aes(x = Date, y = mean_y), col = "#356211") +
  geom_line(aes(x = Date, y = mean_yhat), col = "#cda989") -> ts_graph
 
ts_graph + geom_vline(aes(xintercept = as.numeric(Date[113])), 
                   linetype = "dashed", size = 1,
                   color = "#AA471F") + 
  labs(title = "Lasso with Loop",
       y = "Mean Price")

The graph does not show much of the difference, however we know that it is better because it has lower MSE.


Result


Conclusions

---
title: "BA810 Team 7: Acovado Price Modeling"
author: Ziqin Ma, Qiaoling Huang, Shihan Li, Chenran Peng, Elmira Ushirova
date: October 16, 2019
output: html_notebook
---

## Notebook Setting ##

Load the required packages.
```{r}
options(stringsAsFactors = FALSE)

library(tidyverse)
library(lubridate)
library(ggplot2)
library(ggthemes)
library(fastDummies)
library(scales)
library(glmnet)
library(zoo)
library(ISLR)
library(leaps)
library(rpart)
library(rpart.plot)
library(tree)
library(randomForest)
library(gbm)

# Set the ggtheme beforehead for all plots
theme_set(theme_clean())
```

----------

## Introduction ##

This notebook is created by Cohort A Team 7 for the BA 810 team project. The content includes our business setting to a existing dataset and our attempts in machine learning model training.

### Business Problem ###

Our team is planning to open a store called "Everything Avocado", basically selling avocado food products including avocado toasts, tortilla chips with guacamole, and avocado smoothies. During opening preparation, we want to set a budget in buying raw avocados as inventories, so we would like to learn the future price for avocados.

Fortunately, all of the team members are taking a machine learning course together, then we decided to develope a model to **predict avocado prices** using existing data source.

### Audience of our model ###

We expect that our optimal model could be applied to other raw product retails, especially in fresh products (fruits and vegetables). Ideally, the price any products of this type could be predicted, if the prices are influenced by similar predictors. In a business setting, whoever wants to get a sense of future product price (retailers, customers, market analysts, etc.) will hold a huge interest in our model.

--------------

## Dataset Debrief ##

* ### Data Source ###

The dataset we keep using is the `Avocado Price` dataset from [Kaggle](https://www.kaggle.com/neuromusic/avocado-prices). It is the record of historical data on avocado prices and sales volume in multiple US markets. 

```{r}
ds <- read_csv("avocado.csv")  
glimpse(ds)
```
Here is a quick view of columns and data types. There are 14 columns in this dataset, and we will use `AveragePrice` as our prediction. `X1` gives a id for each observation, and we want to transfer that into row numbers for each row by adding 1. Most of the data types are `<dbl>`, and there are only 2 columns with character vectors, `type` and `region`. We will transfer them to dummy variables to train our models. The time span is from Janurary 4, 2015 to March 25, 2018. 

* ### Descriptive Analysis ###

Before going into the modeling, we made a plot for the average price of avocados grouped by date, and we observed that average price usually increases during the second half of the year.
```{r}
ds %>% 
group_by(Date) %>%
  summarise(meanpriced = mean(AveragePrice)) %>%
  ggplot(aes(x = Date, y = meanpriced)) +
  geom_point(col = "#4a7337") +
  geom_smooth(col = "#d9cd65", se = FALSE) + 
  labs(y = "Average price",
       title = "Average price by date") + 
  theme_bw()
```

We also map the average quantity (Volume) by dates. We noticed that in the beginning of each year, there is one day when the volume sold is extremely high. We then checked the calendar and happened to find that these days were all Super Bowl game days!
```{r}
ds %>%
  group_by(Date) %>%
  summarise(totalvold = sum(`Total Volume`)) %>%
  ggplot(aes(x = Date, y = totalvold)) +
  geom_point(col = "#4a7337") +
  geom_smooth(col = "#d9cd65", se = FALSE) + 
  labs(y = "Total volumn sold",
       title = "Total volumn sold by date") + 
  theme_bw() 
```


----------

## Data Preparation ##

In the following code chunk, we took multiple steps to arrange the columns.
```{r}
# Switch the order of rows
ds <- ds %>% 
  group_by(type, region) %>% 
  select(X1, year, Date, type, region, everything()) %>% 
  arrange(Date)

# Change `X1` to `ID`
colName <- names(ds)
colName[1] <- "ID"
names(ds) <- colName

# Assign distinct number id to each observation in `ID` column
ds$ID <- seq(nrow(ds))

# Extract month from `Date` column
ds$month <- month(ds$Date)
ds <- ds %>% 
  select(ID, year, month, everything())

```


There are only 2 types of avocados recorded, conventional and organic. We decided to separate `type` column into `type_conventional` and `type_organic`.
```{r}
dsNew <- dummy_cols(ds, select_columns = "type") %>% 
  select(ID, year, month, region, type_conventional, type_organic, 
         everything(), -type)

```

The Product Look Up (PLU) codes are used by grocery retailers to make differnet product inventories. This dataset includes 3 PLU avocados, and we added a new column `other_PLU` to see the volume sold in products without PLU. The number is calculated by substracting PLU volumes from total volumes.
```{r}
dsNew$other_PLU <- dsNew$`Total Volume` - dsNew$`4046` - dsNew$`4225` - dsNew$`4770`

# Reorder the columes
dsNew <- dsNew %>% 
  select(1:3, Date, everything())
```
We initially included `other_PLU` as a predictor to our models, but then we found that `other_PLU` is eauql to `Total Bags`. This equivlance resolved our confusion on bag variables: ?????


Categorize `region` into US Areas
```{r}
uniqueRegion <- unique(dsNew$region)
uniqueRegion <- as.data.frame(uniqueRegion)
uniqueRegion$Area <- NA
uniqueRegion$Area[1] <- "NewEngland"
uniqueRegion$Area[2] <- "Southeast"
uniqueRegion$Area[3] <- "Mideast"
uniqueRegion$Area[4] <- "RockyMountain"
uniqueRegion$Area[5] <- "NewEngland"
uniqueRegion$Area[6] <- "Mideast"
uniqueRegion$Area[7] <- "FarWest"
uniqueRegion$Area[8] <- "Southeast"
uniqueRegion$Area[9] <- "GreatLakes"
uniqueRegion$Area[10] <- "GreatLakes"
uniqueRegion$Area[11] <- "GreatLakes"
uniqueRegion$Area[12] <- "Southwest"
uniqueRegion$Area[13] <- "RockyMountain"
uniqueRegion$Area[14] <- "GreatLakes"
uniqueRegion$Area[15] <- "GreatLakes"
uniqueRegion$Area[16] <- "GreatLakes"
uniqueRegion$Area[17] <- "Mideast"
uniqueRegion$Area[18] <- "NewEngland"
uniqueRegion$Area[19] <- "Southeast"
uniqueRegion$Area[20] <- "GreatLakes"
uniqueRegion$Area[21] <- "Southeast"
uniqueRegion$Area[22] <- "FarWest"
uniqueRegion$Area[23] <- "FarWest"
uniqueRegion$Area[24] <- "Southeast"
uniqueRegion$Area[25] <- "Southeast"
uniqueRegion$Area[26] <- "Southeast"
uniqueRegion$Area[27] <- "Southeast"
uniqueRegion$Area[28] <- "Southeast"
uniqueRegion$Area[29] <- "Mideast"
uniqueRegion$Area[30] <- "NewEngland"
uniqueRegion$Area[31] <- "NewEngland"
uniqueRegion$Area[32] <- "Southeast"
uniqueRegion$Area[33] <- "Mideast"
uniqueRegion$Area[34] <- "Southwest"
uniqueRegion$Area[35] <- "Mideast"
uniqueRegion$Area[36] <- "Plains"
uniqueRegion$Area[37] <- "FarWest"
uniqueRegion$Area[38] <- "Southeast"
uniqueRegion$Area[39] <- "Southeast"
uniqueRegion$Area[40] <- "Southeast"
uniqueRegion$Area[41] <- "FarWest"
uniqueRegion$Area[42] <- "FarWest"
uniqueRegion$Area[43] <- "FarWest"
uniqueRegion$Area[44] <- "FarWest"
uniqueRegion$Area[45] <- "Southeast"
uniqueRegion$Area[46] <- "Southeast"
uniqueRegion$Area[47] <- "Southeast"
uniqueRegion$Area[48] <- "FarWest"
uniqueRegion$Area[49] <- "Plains"
uniqueRegion$Area[50] <- "Mideast"
uniqueRegion$Area[51] <- "Southeast"
uniqueRegion$Area[52] <- "TotalUS"
uniqueRegion$Area[53] <- "FarWest"
uniqueRegion$Area[54] <- "Southwest"
names(uniqueRegion)[1] <- "region"

avo <- dsNew %>% 
  left_join(uniqueRegion, by = "region") %>% 
  select(1:5, Area, everything())


avo <- dummy_cols(avo, select_columns = "Area")

```


Rename Column Names
```{r}
colnames(avo)

names(avo)[10] <- "TotalVolume"
names(avo)[14] <- "TotalBags"
names(avo)[15] <- "SmallBags"
names(avo)[16] <- "LargeBags"
names(avo)[17] <- "XLargeBags"
names(avo)[11] <- "PLU4046"
names(avo)[12] <- "PLU4225"
names(avo)[13] <- "PLU4770"
```

----------

## Create Train and Test Datasets ##

```{r}
set.seed(1234)
avo_train <- avo %>% filter(as.Date(Date) < "2017-03-01")
avo_train %>%
  filter(year == 2017, month == 2)
avo_test <- avo %>% filter(as.Date(Date) >= "2017-03-01")
avo_test %>%
  filter(year == 2018, month == 3)


```


Base model
```{r}
f1 <- as.formula(AveragePrice ~ month+type_conventional + type_organic + TotalVolume + 
                   PLU4046 + PLU4770 + PLU4225 + SmallBags + LargeBags + XLargeBags + 
                   + Area_NewEngland + Area_Southeast
                 + Area_Mideast + Area_RockyMountain
                 + Area_FarWest + Area_GreatLakes
                 + Area_Southwest + Area_Plains + Area_TotalUS)

x1_train <- model.matrix(f1,avo_train)[,-1]
y1_train <- avo_train$AveragePrice
x1_test <- model.matrix(f1,avo_test)[,-1]
y1_test <- avo_test$AveragePrice
x1_avo <- model.matrix(f1, avo)[,-1]
```

----------

## Modeling ##

* ### Linear Regression ###

```{r}
fit.lm <- lm(f1, avo_train)
fit.lm

y.train <- avo_train$AveragePrice
y.test <- avo_test$AveragePrice
yhat.train.lm <- predict(fit.lm)
mse.train.lm <- mean((y.train - yhat.train.lm)^2)
yhat.test.lm <- predict(fit.lm, avo_test)
mse.test.lm <- mean((y.test - yhat.test.lm)^2)
mse.train.lm
mse.test.lm

coef(fit.lm)
plot(fit.lm)
```

* ### Forward Selection ###

```{r}
avo1 <- avo
avo1$ID <- NULL
avo1$year<-NULL
avo1$Date<-NULL
avo1$region<-NULL
avo1$Area<-NULL
avo1$AveragePrice<-NULL
xnames <- colnames(avo1)
xnames <- xnames[!xnames %in% c("type_conventional","type_organic", "TotalVolume", "PLU4046", "PLU4770", "PLU4225","SmallBags", "LargeBags","XLargeBags","Area_NewEngland","Area_Southeast","Area_Mideast","Area_RockyMountain","Area_FarWest","Area_GreatLakes","Area_Southwest","Area_Plains + Area_TotalUS")]
fit_fw <- lm(AveragePrice ~ 1, data = avo_train)
yhat_train <- predict(fit_fw, avo_train)
mse_train <- mean((avo_train$AveragePrice - yhat_train) ^ 2)
yhat_test <- predict(fit_fw, avo_test)
mse_test <- mean((avo_test$AveragePrice - yhat_test) ^ 2)

log_fw <-
tibble(
xname = "intercept",
model = deparse(fit_fw$call),
mse_train = mse_train,
mse_test = mse_test
)

best_mse_train <- NA
best_mse_test <- NA
best_fit_fw <- NA
best_xname <- NA
for (xname in xnames) {
# take a moment to examine and understand the following line
fit_fw_tmp <- update(fit_fw, as.formula(paste0(" ~ ", xname)))
# compute MSE train
yhat_train_tmp <- predict(fit_fw_tmp, avo_train)
mse_train_tmp <- mean((avo_train$AveragePrice - yhat_train_tmp) ^ 2)
# compute MSE test
yhat_test_tmp <- predict(fit_fw_tmp, avo_test)
mse_test_tmp <- mean((avo_test$AveragePrice - yhat_test_tmp) ^ 2)
# if this is the first predictor to be examined,
# or if this predictors yields a lower MSE that the current
# best, then store this predictor as the current best predictor
if (is.na(best_mse_test) | mse_test_tmp < best_mse_test) {
best_xname <- xname
best_fit_fw <- fit_fw_tmp
best_mse_train <- mse_train_tmp
best_mse_test <- mse_test_tmp
}
}
best_mse_train
best_mse_test
```

```{r}
log_fw <-
log_fw %>% add_row(
xname = best_xname,
model = paste0(deparse(best_fit_fw$call), collapse = ""),
mse_train = best_mse_train,
mse_test = best_mse_test
)

log_fw

# here is a complete solution
xnames<- colnames(avo1)
xnames <- xnames[!xnames %in% c("type_conventional","type_organic", "TotalVolume", "PLU4046", "PLU4770", "PLU4225","SmallBags", "LargeBags","XLargeBags","Area_NewEngland","Area_Southeast","Area_Mideast","Area_RockyMountain","Area_FarWest","Area_GreatLakes","Area_Southwest","Area_Plains + Area_TotalUS")]
fit_fw <- lm(AveragePrice ~ 1, data = avo_train)
yhat_train <- predict(fit_fw, avo_train)
yhat_test <- predict(fit_fw, avo_test)
mse_train <- mean((avo_train$AveragePrice - yhat_train)^2)
mse_test <- mean((avo_test$AveragePrice - yhat_test)^2)
xname <- "intercept"

log_fw <-
tibble(
xname = xname,
model = paste0(deparse(fit_fw$call), collapse = ""),
mse_train = mse_train,
mse_test = mse_test
)
while (length(xnames) > 0) {
best_mse_train <- NA
best_mse_test <- NA
best_fit_fw <- NA
best_xname <- NA
# select the next best predictor
for (xname in xnames) {
# take a moment to examine and understand the following line
fit_fw_tmp <- update(fit_fw, as.formula(paste0(". ~ . + ", xname)))
# compute MSE train
yhat_train_tmp <- predict(fit_fw_tmp, avo_train)
mse_train_tmp <- mean((avo_train$AveragePrice - yhat_train_tmp) ^ 2)
# compute MSE test
yhat_test_tmp <- predict(fit_fw_tmp, avo_test)
mse_test_tmp <- mean((avo_test$AveragePrice - yhat_test_tmp) ^ 2)

# if this is the first predictor to be examined,
# or if this predictors yields a lower MSE that the current
# best, then store this predictor as the current best predictor
if (is.na(best_mse_test) | mse_test_tmp < best_mse_test) {
best_xname <- xname
best_fit_fw <- fit_fw_tmp
best_mse_train <- mse_train_tmp
best_mse_test <- mse_test_tmp
}
}
log_fw <-
log_fw %>% add_row(
xname = best_xname,
model = paste0(deparse(best_fit_fw$call), collapse = ""),
mse_train = best_mse_train,
mse_test = best_mse_test
)
# adopt the best model for the next iteration
fit_fw <- best_fit_fw
# remove the current best predictor from the list of predictors
xnames <- xnames[xnames!=best_xname]
}
```

```{r}
ggplot(log_fw, aes(seq_along(xname), mse_test)) +
geom_point() +
geom_line() +
geom_point(aes(y=mse_train), color="#AA471F") +
geom_line(aes(y=mse_train), color="#AA471F") +
scale_x_continuous("Variables", labels = log_fw$xname, breaks = seq_along(log_fw$xname)) +
scale_y_continuous("MSE test") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

```{r}
fw<- as.formula(AveragePrice ~ month + TotalBags + Area_TotalUS + Area_Plains + other_PLU)
fit.fw<-lm(fw, data = avo_train)
```

```{r}
## calculate the yhat price for the avo dataset
yhat_avo_fw <- predict(fit.fw, avo)
df1_fw<-avo %>% 
  select(Date, AveragePrice)
df2_fw<-cbind(df1_fw, yhat_avo_fw)
colnames(df2_fw)
names(df2_fw)[3]<-"AveragePrice_hat"
## Plot the actual average price and the predictive average price
plot1_fw <- df2_fw %>% 
  group_by(Date) %>% 
  summarize(
    MeanAvg=mean(AveragePrice),
    MeanAvg_hat=mean(AveragePrice_hat))%>% 
    ggplot()+
    geom_line(aes(Date, MeanAvg),color = "#356211")+
    geom_line(aes(Date, MeanAvg_hat), color = "#cda989")+
    theme_classic()

plot1_fw + geom_vline(aes(xintercept = as.numeric(Date[113])), 
                   linetype = "dashed", size = 1,
                   color = "#AA471F")

```


* ### Backward Selection ###
Using `f1` formula to match average price with all predictors in order to see the best model which contains a giving number of predictors. 
```{r}
regfit.bwd = regsubsets(f1, data = avo_train, nvmax = 19, method = "backward")
summary(regfit.bwd)
```
Calculate the yhat and MSE for train and test dataset, the test MSE is way to large for the model including all the predictors.
```{r}
sub_model<-lm(f1, data = avo_train)
yhat_train_stepwise <- predict(sub_model, avo_train)
MSE_train_stepwise <- mean((avo_train$AveragePrice - yhat_train_stepwise)^2)
MSE_train_stepwise
yhat_test_stepwise <- predict(sub_model, avo_test)
MSE_test_stepwise <- mean((avo_test$AveragePrice - yhat_test_stepwise)^2)
MSE_test_stepwise
```
Based on the result of summary, the best fit model contains everything except  "Area_TotalUS", "type_organic". So we excluded these two predictors to create a new formula to train the model. 
```{r}
f1_1 <- as.formula(AveragePrice ~ month + type_conventional + TotalVolume + 
                      PLU4046 + PLU4770 + PLU4225 + SmallBags + LargeBags + XLargeBags + 
                      + Area_NewEngland + Area_Southeast
                    + Area_Mideast + Area_RockyMountain
                    + Area_FarWest + Area_GreatLakes
                    + Area_Southwest
                    + Area_Plains)
regfit.bwd1 = regsubsets(f1_1, data = avo_train, nvmax = 19, method = "backward")
summary(regfit.bwd1)

sub_model1<-lm(f1_1, data = avo_train)
yhat_train_stepwise1 <- predict(sub_model1, avo_train)
MSE_train_stepwise1 <- mean((avo_train$AveragePrice - yhat_train_stepwise1)^2)
MSE_train_stepwise1
yhat_test_stepwise1 <- predict(sub_model1, avo_test)
MSE_test_stepwise1 <- mean((avo_test$AveragePrice - yhat_test_stepwise1)^2)
MSE_test_stepwise1
```
However, the test MSE is still way to large, therefore, we decide to remove the predictor one by one based on the result of summary. We try to remove "Area_Plains", "Area_FarWest", "LargeBags" one by one, we found that removing "LargeBags" can significantly reduce the test MSE to 0.1397455 which is the lowest MSE that we have tried. So the best fit model is `f1_2`.
```{r}
f1_2 <- as.formula(AveragePrice ~ month + type_conventional + TotalVolume + 
                     PLU4046 + PLU4770 + PLU4225 + SmallBags + XLargeBags
                     + Area_NewEngland +
                   Area_Mideast + Area_RockyMountain
                   + Area_FarWest + Area_GreatLakes + Area_Southwest + Area_Plains
                   )
regfit.bwd2 = regsubsets(f1_2, data = avo_train, nvmax = 19, method = "backward")
summary(regfit.bwd2)

sub_model2<-lm(f1_2, data = avo_train)
yhat_train_stepwise2 <- predict(sub_model2, avo_train)
MSE_train_stepwise2 <- mean((avo_train$AveragePrice - yhat_train_stepwise2)^2)
MSE_train_stepwise2
yhat_test_stepwise2 <- predict(sub_model2, avo_test)
MSE_test_stepwise2 <- mean((avo_test$AveragePrice - yhat_test_stepwise2)^2)
MSE_test_stepwise2
```

Calculate the yhat price for the `avo` dataset which contain train and test dataset and merge them to the a single date frame.
```{r}
yhat_avo_avgprice <- predict(sub_model2, avo)
df1_bwd<-avo %>% 
  select(Date, AveragePrice)
df2_bwd<-cbind(df1_bwd, yhat_avo_avgprice)
colnames(df2_bwd)
names(df2_bwd)[3]<-"AveragePrice_hat"
```

Plot the actual average price and the predictive average price, the prediction for Backward selection is very similar with ridge and lasso based on the graph which will show as following. 
```{r}
plot1_bwd <- df2_bwd %>% 
  group_by(Date) %>% 
  summarize(
    MeanAvg=mean(AveragePrice),
    MeanAvg_hat=mean(AveragePrice_hat))%>% 
    ggplot()+
    geom_line(aes(Date, MeanAvg),color = "#356211")+
    geom_line(aes(Date, MeanAvg_hat), color = "#cda989")+
    theme_classic()

plot1_bwd + geom_vline(aes(xintercept = as.numeric(Date[113])), 
                   linetype = "dashed", size = 1,
                   color = "#AA471F")
```

* ### Ridge Regression ###

run ridge
```{R}
fit_ridge <- cv.glmnet(x1_train, y1_train, alpha = 0, nfolds = 100)
fit_ridge$lambda
```

Predict response
```{R}
yhat_train_ridge <- predict(fit_ridge, x1_train, s = fit_ridge$lambda)
yhat_test_ridge <- predict(fit_ridge, x1_test, s = fit_ridge$lambda)

mse_train_ridge = vector()
mse_test_ridge = vector()
mse_train_ridge <- mean((y1_train - yhat_train_ridge)^2)
mse_test_ridge <-mean((y1_test - yhat_test_ridge)^2)
for (i in 1:length(fit_ridge$lambda)) {
  mse_train_ridge[i] <- mean((y1_train - yhat_train_ridge)[,i]^2)
  mse_test_ridge[i] <- mean((y1_test - yhat_test_ridge)[,i]^2)
}
mse_train_ridge
mse_test_ridge
min(mse_train_ridge)
min(mse_test_ridge)

```

```{R}
lambda_min_mse_train_ridge<- fit_ridge$lambda[which.min(mse_train_ridge)]
lambda_min_mse_test_ridge <-fit_ridge$lambda[which.min(mse_test_ridge)]
lambda_min_mse_train_ridge
lambda_min_mse_test_ridge

yhat_train_ridge <- predict(fit_ridge, x1_train, s = 0.0261992)
yhat_test_ridge <- predict(fit_ridge, x1_test, s =   0.1160787)
yhat_1<- predict(fit_ridge, x1_avo, s = lambda_min_mse_test_ridge)
```

Aggregate data into one dataframe
```{R}
p1<-avo %>%
  select(Date, AveragePrice)
p2<-cbind(p1, yhat_1)
```

Plot
```{R}
p2 %>%
  group_by(Date)%>%
  summarise(meanpriced = mean(AveragePrice))

names(p2)[3] <- "pre"
p2 %>%
  group_by(Date) %>%
  summarise(meanpriced = mean(AveragePrice),meanpre = mean(pre))%>%
  ggplot()+
  geom_line(mapping = aes(x=Date,
                           y=meanpriced), col = "#356211")+
  geom_line(mapping = aes(x=Date, y= meanpre), col = "#cda989")+
  geom_vline(xintercept=as.numeric(as.Date("2017-03-01")), col = "#AA471F", linetype = "dashed") + 
  labs(title = "Ridge Regression",
       y = "Mean Price")
```
The ridge model is pretty much the same with lasso model, so we will see the difference between two models to see which model gives a more accurate prediction for our dataset after building a lasso model.


* ### Lasso Regression ###

Run lasso model
```{r}
lmodel <- glmnet(x1_train, y1_train, alpha = 1, nlambda = 100)
lmodel$lambda
```
Predict response shows the minimue mse for the train(0.06472616) and test(0.1390842). 
```{r}
y1_train_hat <- predict(lmodel, s = lmodel$lambda, newx = x1_train)
y1_test_hat <- predict(lmodel, s = lmodel$lambda, newx = x1_test)

mse_train = vector()
mse_test = vector()

for (i in 1:length(lmodel$lambda)) {
  mse_train[i] <- mean((y1_train - y1_train_hat)[,i]^2)
  mse_test[i] <- mean((y1_test - y1_test_hat)[,i]^2)
}

min(mse_train)
min(mse_test)
```
Check the minimun lambda for the train and test dataset
```{r}
lambda_min_mse_train_lasso<- lmodel$lambda[which.min(mse_train)]
lambda_min_mse_test_lasso <-lmodel$lambda[which.min(mse_test)]
lambda_min_mse_train_lasso
lambda_min_mse_test_lasso
```
Using Cross-validation fucntion to find the best lambda, the result shows the best lambda for train dataset is same as above. 
```{r}
set.seed(1)
cv.out = cv.glmnet(x1_train, y1_train, alpha = 1)
## plot the lambda
plot(cv.out)
##check the best lambda
bestlam = cv.out$lambda.min
bestlam ## the best lamdba for training dataset is same as lambda_min_mse_train
```
Check the coefficient for f1 model, the result shows that there are some predictors might be eliminated. "TotalVolume", "PLU4225", SmallBags", and "Area_TotalUS". 
```{r}
f1coef<-coef(lmodel, s = lambda_min_mse_test_lasso)
f1coef
```

Create a new formula for new predictors(eliminate the uncorrelated predictors)
```{r}
f2 <- as.formula(AveragePrice ~ month + type_conventional + type_organic + 
                   PLU4046 + PLU4770 + LargeBags + XLargeBags + 
                   Area_NewEngland + Area_Southeast + Area_Mideast + 
                   Area_RockyMountain + Area_GreatLakes + 
                   Area_Southwest + Area_Plains)
x2_train <- model.matrix(f2,avo_train)[,-1]
x2_test <- model.matrix(f2, avo_test)[,-1]
```
Run lasso model again with new predictors
```{r}
lmodel2 <- glmnet(x2_train, y1_train, alpha = 1, nlambda = 100)
```
Predict response with new predictors
```{r}
y2_train_hat <- predict(lmodel2, s = lmodel2$lambda, newx = x2_train)
y2_test_hat <- predict(lmodel2, s = lmodel2$lambda, newx = x2_test)
```
Compute MES again with new predictors, the results shows the taining data MSE is increased when eliminate the uncorrelated predictors, and the train MSE (0.06482449) increases a little but the test MSE reduces a little bit (0.1390709).
```{r}
mse_train1 = vector()
mse_test1 = vector()

for (i in 1:length(lmodel2$lambda)) {
  mse_train1[i] <- mean((y1_train - y2_train_hat)[,i]^2)
  mse_test1[i] <- mean((y1_test - y2_test_hat)[,i]^2)
}

min(mse_train1)
min(mse_test1)
```
Check again with the new minimum lambda for the train and test dataset
```{r}
lambda_min_mse_train1<- lmodel2$lambda[which.min(mse_train1)]
lambda_min_mse_test1 <-lmodel2$lambda[which.min(mse_test1)]
lambda_min_mse_train1
lambda_min_mse_test1
```
Check coefficients for f2, there is no more uncorrelated predictors in our model. 
```{r}
f2coef<-coef(lmodel2, s = lambda_min_mse_test1)
f2coef
```
Since the second model have the lowest test MSE, we decided to use "lmodel2" to make the final prediction. 
```{r}
x2_avo <- model.matrix(f2, avo)[,-1]
y2_avo_min_lambda_hat <- predict(lmodel2, s = lambda_min_mse_test1, newx = x2_avo)
class(y2_avo_min_lambda_hat)
```
Aggregate data into one dataframe for the first model prediction
```{r}
df1<-avo %>% 
  select(Date, AveragePrice)
df2<-cbind(df1, y2_avo_min_lambda_hat)

names(df2)[3]<-"AveragePrice_hat"
```
Plot the actual average price and the predictive average price
```{r}
head(df2)
class(df2$Date)
plot1 <- df2 %>% 
  group_by(Date) %>% 
  summarize(
    MeanAvg=mean(AveragePrice),
    MeanAvg_hat=mean(AveragePrice_hat)) %>%
  ggplot()+
  geom_line(aes(Date, MeanAvg),color = "#356211")+
  geom_line(aes(Date, MeanAvg_hat), color = "#cda989")+
  theme_classic()

plot1 + geom_vline(aes(xintercept = as.numeric(Date[113])), 
                   linetype = "dashed", size = 1,
                   color = "#AA471F")
```
We can see that our ridge model actually did a better job than lasso. 
Ridge is more flexibly than lasso, maybe each variable have something to do in our data set, so ridge works better than lasso. 

* ### Decision Tree ###

We use `rpart` function to build decision trees. 
```{r}
fit.tree <- rpart(f1,
                  avo_train,
                  control = rpart.control(cp = 0.001))

par(xpd = TRUE)
plot(fit.tree, compress=TRUE)
text(fit.tree, use.n=TRUE)
rpart.plot(fit.tree, type = 1)
```

```{r}
yhat.train.tree <- predict(fit.tree, avo_train)
mse.train.tree <- mean((avo_train$AveragePrice - yhat.train.tree)^2)
mse.train.tree
```

```{r}
yhat.test.tree <- predict(fit.tree, avo_test)
mse.test.tree <- mean((avo_test$AveragePrice - yhat.test.tree)^2)
mse.test.tree
```
The outcome is too difficult to read, even after we adjust the type. The test MSE is 0.1468156, which is greater than Lasso. Decision trees make little sense in predicting prices, and we trained this one just for practice.

* ### Regression Tree ###

We use `tree` function to create regression trees.
```{r}
tree.avo =tree(f1, avo_train)
summary(tree.avo)
```

```{r}
cv.avo =cv.tree(tree.avo)

prune.avo =prune.tree(tree.avo,best =5)
plot(prune.avo)
text(prune.avo,pretty =0)
```


```{r}
yhat <- predict(tree.avo, newdata = avo_test)
mse_regreTree_test <- mean((yhat - avo_test$AveragePrice)^2)
mse_regreTree_test
```
The terminal nodes show the price per avocado in each branch, and the test MSE is 0.1499934, even greater than Decision Tree.

* ### Bagging ###

We use `randomForest` function to build bagging trees by setting `mtry` to the number of predictors in our base function f1. 
```{r}
set.seed (1)
bag.avo =randomForest(f1, data = avo_train, 
                           mtry = 19, importance =TRUE)
bag.avo
```

Then we create a scatter plot of test prediction and test actual prices, and add a trend line in 45 degrees. 
```{r}
yhat.bag = predict (bag.avo, newdata = avo_test)
plot(yhat.bag, avo_test$AveragePrice, main = "Scatter Plot for Bagging",
     xlab = "Prediction price in test set", ylab = "Average price in test set", 
     col = "#bdcc64")
abline (0,1)

mse_bag_test <- mean((yhat.bag - avo_test$AveragePrice)^2)
mse_bag_test
```
As we can see from this graph, the scatters are clustered above the trendline. Ideally, the scatters should lie on top of the trend line. The test MSE is 0.1350875, and this should be treated as the bottomline for other tree models. 

* ### Random Forests ###

Building the model first:

```{r}
fit_rf <- randomForest(f1,
                       avo_train,
                       ntree=300,
                       do.trace=F)
varImpPlot(fit_rf)
```
Predicting our y_hats for train and for test data:

```{r}
yhat_rf_train <- predict(fit_rf, avo_train)
yhat_rf_test <- predict(fit_rf, avo_test)
```

Calculating MSE for both train and test:

```{r}
mse_rf_train <- mean((yhat_rf_train - y1_train) ^ 2)
mse_rf_test <- mean((yhat_rf_test - y1_test)^2)

print(mse_rf_train)
print(mse_rf_test)
```

Plotting actual prices and predicted prices on one plot:
Have to prepare data to use it for the plot first by creating a dataframe with Date, AveragePrice and Predicted Price (y_hat) for both train and test:

```{r}
avo %>% 
  select(Date, AveragePrice) -> plot_rf_full

yhat_rf_full <- c(yhat_rf_train,yhat_rf_test )

cbind(plot_rf_full, yhat_rf_full) -> plot_rf_full1  
```



```{r}
plot_rf_full1 %>% 
  group_by(Date) %>% 
  summarise(mean_y = mean(AveragePrice),
            mean_yhat = mean(yhat_rf_full)) %>% 
  ggplot() +
  geom_line(aes(x = Date, y = mean_y), col = "#356211") +
  geom_line(aes(x = Date, y = mean_yhat), col = "#cda989") -> rf_graph
 
rf_graph + geom_vline(aes(xintercept = as.numeric(Date[113])), 
                   linetype = "dashed", size = 1,
                   color = "#AA471F") + 
  labs(title = "Random Forest",
       y = "Mean Price")
```
As we can see on the graph above, with 300 trees, random forest fit train data pretty well, however it is doing bad with test data. Nevertheless, the graph looks better than other less flexible models we have used.

We could have tried to increase the number of trees to see if it would give better estimations, however, unfortunately, the capacity of RStudios that we have used is not enough to run random forests with more than 300 trees.




* ### Boosting ###

We use the `gbm` function to run boosting trees. We set the number of trees to be 150, which is half of the trees in Random Forest. The `interaction.depth` is set to 4 and `shrinkage` is 0.1. We add a commend in this function, `cv.folds`, to apply KNN cross-validation in traing, and we split the training set into 10 folds.
```{r}
boostingcv <- gbm(f1,
                  data = avo_train,
                  distribution = "gaussian",
                  n.trees = 150,
                  interaction.depth = 4,
                  cv.folds = 10,
                  shrinkage = 0.1)
relative.influence(boostingcv)

yhat_btree <- predict(boostingcv, avo_train, n.trees = 150)
mse_btree <- mean((yhat_btree - y1_train) ^ 2)
print(mse_btree)
```
Here we get the train MSE for boosting. Then we calculate the test MSE.


```{r}
yhat_btree_test <- predict(boostingcv, avo_test, n.trees = 150)
mse_btree_test <- mean((yhat_btree_test - y1_test) ^ 2)
print(mse_btree_test)
```

Next, we add the boosting predictions into train and test sets to plot the trend for predictions and actuals.
```{r}
avo_train$prediction_btree <- yhat_btree
avo_test$prediction_btree <- yhat_btree_test

avo_plot <- rbind(avo_train, avo_test)
```

```{r}
btree_plot <- avo_plot %>% 
  group_by(Date) %>% 
  summarise(meanAvg = mean(AveragePrice),
            meanAvg_hat = mean(prediction_btree)) %>% 
  ggplot() +
  geom_line(aes(Date, meanAvg), col = "#356211") + 
  geom_line(aes(Date, meanAvg_hat), col = "#cda989") + 
  labs(title = "Boosting",
       y = "Mean Price") +
  theme_clean()

btree_plot + 
  geom_vline(aes(xintercept = as.numeric(Date[113])),
             linetype = "dashed", size = 1, col = "#AA471F")
```
The boosting model gives a good illustration in seasonality, though the tset MSE is a little bit higher then Random Forest. At the end of this step, we prefer Random Forest as our choice of model

* ### Lasso with a New Variable ###
```{r}
ds1 <- read_csv("avocado3.csv")
glimpse(ds1)
```
The new dataset called `avocado3` is the dataset that we created based on the original avocado dataset. We create two more columns called "NewPrice" and "NewPrice2" which was based on the previous price that we have as "AveragePrice". The NewPrice was come from the AveragePrice of one row above and the "NewPrice2" was from  the AveragePrice of two rows above. We are going to add these two columns as new variables to do a more accurate prediction for our time-series based dataset. 

```{r}
ds1 <- ds1 %>% 
  group_by(type, region) %>% 
  select(X1, year, Date, type, region, everything()) %>% 
  arrange(Date)

ds1 <- ds1 %>% 
  group_by(type, region) %>% 
  select(X1, year, Date, type, region, everything()) %>% 
  arrange(Date)

colName <- names(ds1)
colName[1] <- "ID"
names(ds1) <- colName

ds1$ID <- seq(nrow(ds1))

ds1$month <- month(ds1$Date)
ds1 <- ds1 %>% 
  select(ID, year, month, everything())

ds1New <- dummy_cols(ds1, select_columns = "type") %>% 
  select(ID, year, month, region, type_conventional, type_organic, 
         everything(), -type)

ds1New$other_PLU <- ds1New$`Total Volume` - ds1New$`4046` - ds1New$`4225` - ds1New$`4770`

ds1New <- ds1New %>% 
  select(1:3, Date, everything())

uniqueRegion <- unique(ds1New$region)
uniqueRegion <- as.data.frame(uniqueRegion)
uniqueRegion$Area <- NA
uniqueRegion$Area[1] <- "NewEngland"
uniqueRegion$Area[2] <- "Southeast"
uniqueRegion$Area[3] <- "Mideast"
uniqueRegion$Area[4] <- "RockyMountain"
uniqueRegion$Area[5] <- "NewEngland"
uniqueRegion$Area[6] <- "Mideast"
uniqueRegion$Area[7] <- "FarWest"
uniqueRegion$Area[8] <- "Southeast"
uniqueRegion$Area[9] <- "GreatLakes"
uniqueRegion$Area[10] <- "GreatLakes"
uniqueRegion$Area[11] <- "GreatLakes"
uniqueRegion$Area[12] <- "Southwest"
uniqueRegion$Area[13] <- "RockyMountain"
uniqueRegion$Area[14] <- "GreatLakes"
uniqueRegion$Area[15] <- "GreatLakes"
uniqueRegion$Area[16] <- "GreatLakes"
uniqueRegion$Area[17] <- "Mideast"
uniqueRegion$Area[18] <- "NewEngland"
uniqueRegion$Area[19] <- "Southeast"
uniqueRegion$Area[20] <- "GreatLakes"
uniqueRegion$Area[21] <- "Southeast"
uniqueRegion$Area[22] <- "FarWest"
uniqueRegion$Area[23] <- "FarWest"
uniqueRegion$Area[24] <- "Southeast"
uniqueRegion$Area[25] <- "Southeast"
uniqueRegion$Area[26] <- "Southeast"
uniqueRegion$Area[27] <- "Southeast"
uniqueRegion$Area[28] <- "Southeast"
uniqueRegion$Area[29] <- "Mideast"
uniqueRegion$Area[30] <- "NewEngland"
uniqueRegion$Area[31] <- "NewEngland"
uniqueRegion$Area[32] <- "Southeast"
uniqueRegion$Area[33] <- "Mideast"
uniqueRegion$Area[34] <- "Southwest"
uniqueRegion$Area[35] <- "Mideast"
uniqueRegion$Area[36] <- "Plains"
uniqueRegion$Area[37] <- "FarWest"
uniqueRegion$Area[38] <- "Southeast"
uniqueRegion$Area[39] <- "Southeast"
uniqueRegion$Area[40] <- "Southeast"
uniqueRegion$Area[41] <- "FarWest"
uniqueRegion$Area[42] <- "FarWest"
uniqueRegion$Area[43] <- "FarWest"
uniqueRegion$Area[44] <- "FarWest"
uniqueRegion$Area[45] <- "Southeast"
uniqueRegion$Area[46] <- "Southeast"
uniqueRegion$Area[47] <- "Southeast"
uniqueRegion$Area[48] <- "FarWest"
uniqueRegion$Area[49] <- "Plains"
uniqueRegion$Area[50] <- "Mideast"
uniqueRegion$Area[51] <- "Southeast"
uniqueRegion$Area[52] <- "TotalUS"
uniqueRegion$Area[53] <- "FarWest"
uniqueRegion$Area[54] <- "Southwest"
names(uniqueRegion)[1] <- "region"

avo <- ds1New %>% 
  left_join(uniqueRegion, by = "region") %>% 
  select(1:5, Area, everything())
avo <- dummy_cols(avo, select_columns = "Area")

names(avo)[10] <- "TotalVolume"
names(avo)[14] <- "TotalBags"
names(avo)[15] <- "SmallBags"
names(avo)[16] <- "LargeBags"
names(avo)[17] <- "XLargeBags"
names(avo)[11] <- "PLU4046"
names(avo)[12] <- "PLU4225"
names(avo)[13] <- "PLU4770"

set.seed(1234)
avo_train <- avo %>% filter(as.Date(Date) < "2017-03-01")
avo_train %>%
  filter(year == 2017, month == 2)
avo_test <- avo %>% filter(as.Date(Date) >= "2017-03-01")
avo_test %>%
  filter(year == 2018, month == 3)
```

pre model
```{r}
f2 <- as.formula(AveragePrice ~ month +type_conventional + type_organic + TotalVolume + 
                   PLU4046 + PLU4770 + PLU4225 + SmallBags + LargeBags + XLargeBags + 
                   + Area_NewEngland + Area_Southeast
                 + Area_Mideast + Area_RockyMountain
                 + Area_FarWest 
                 + Area_GreatLakes + Area_Southwest
                 + Area_Plains + Area_TotalUS + NewPrice + NewPrice2)

x1_train <- model.matrix(f2,avo_train)[,-1]
y1_train <- avo_train$AveragePrice
x1_test <- model.matrix(f2, avo_test)[,-1]
y1_test <- avo_test$AveragePrice
date_test <- avo_test$Date
x1_avo <- model.matrix(f2, avo)[,-1]
```

Run lasso
```{r}
lmodel <- glmnet(x1_train, y1_train, alpha = 1, nlambda = 100)
lmodel$lambda
```

Predict response
```{r}
y1_train_hat <- predict(lmodel, s = lmodel$lambda, newx = x1_train)
y1_test_hat <- predict(lmodel, s = lmodel$lambda, newx = x1_test)
length(y1_test_hat)
mse_train = vector()
mse_test = vector()

for (i in 1:length(lmodel$lambda)) {
  mse_train[i] <- mean((y1_train - y1_train_hat)[,i]^2)
  mse_test[i] <- mean((y1_test - y1_test_hat)[,i]^2)
}
mse_train
mse_test

min(mse_test)
```

```{r}
lambda_min_mse_train<- lmodel$lambda[which.min(mse_train)]
lambda_min_mse_test <-lmodel$lambda[which.min(mse_test)]
lambda_min_mse_train
lambda_min_mse_test
```


Using Cross-validation fucntion to find the best lambda
```{r}
set.seed(1)
cv.out = cv.glmnet(x1_train, y1_train, alpha = 1)
```

Plot the lambda
```{r}
plot(cv.out,
     xlab = "Log(lambda)")

```

Check the best lambda
```{r}
bestlam = cv.out$lambda.min
bestlam ## the best lamdba for training dataset is same as lambda_min_mse_train
```

Create a new formula for new predictors(eliminate the uncorrelated predictors)
```{r}
f3 <- as.formula(AveragePrice ~ month + type_conventional + type_organic + 
                   PLU4046 + PLU4770 + PLU4225 + LargeBags + XLargeBags + 
                   Area_NewEngland + Area_Southeast + Area_Mideast + 
                   Area_RockyMountain + Area_GreatLakes + 
                   Area_Southwest + Area_Plains)
x2_train <- model.matrix(f3,avo_train)[,-1]
x2_test <- model.matrix(f3, avo_test)[,-1]
```

Run lasso model again with new predictors
```{r}
lmodel2 <- glmnet(x2_train, y1_train, alpha = 1, nlambda = 100)
```

Predict response with new predictors
```{r}
y2_train_hat <- predict(lmodel2, s = lmodel2$lambda, newx = x2_train)
y2_test_hat <- predict(lmodel2, s = lmodel2$lambda, newx = x2_test)
```

Compute MES again with new predictors. The results shows the taining data MSE is increased when eliminate the predictors which don't have correlation but the test data MSE still keep the same.
```{r}
mse_train1 = vector()
mse_test1 = vector()

for (i in 1:length(lmodel2$lambda)) {
  mse_train1[i] <- mean((y1_train - y2_train_hat)[,i]^2)
  mse_test1[i] <- mean((y1_test - y2_test_hat)[,i]^2)
}
mse_train1
mse_test1
min(mse_train1)
min(mse_test1)

lambda_min_mse_train1<- lmodel2$lambda[which.min(mse_train1)]
lambda_min_mse_test1 <-lmodel2$lambda[which.min(mse_test1)]
lambda_min_mse_train1
lambda_min_mse_test1
```

Check coefficients for f2 and f3 model
```{r}
f2coef<-coef(lmodel, s = lambda_min_mse_test)
f3coef<-coef(lmodel2, s = lambda_min_mse_test1)
f2coef 
f3coef
```

Since the second model have the training MSE increased, the ideal model is still the first "lmodel", so we decide to use "lmodel" to run the prediction
```{r}
y1_avo_min_lambda_hat <- predict(lmodel, s = lambda_min_mse_test, newx = x1_avo)
class(y1_avo_min_lambda_hat)
```

Aggregate data into one dataframe for the first model prediction
```{r}
df2<-avo %>% 
  select(Date, AveragePrice)
df3<-cbind(df2, y1_avo_min_lambda_hat)
colnames(df3)

names(df3)[3]<-"AveragePrice_hat"
```

Plot the actual average price and the predictive average price
```{r}
head(df3)
class(df3$Date)
 df3 %>% 
  group_by(Date) %>%
  summarise(meanpriced = mean(AveragePrice),meanpre = mean(AveragePrice_hat))%>%
  ggplot()+
  geom_line(mapping = aes(x=Date,
                           y=meanpriced), col = "#356211")+
  geom_line(mapping = aes(x=Date, y= meanpre), col = "#cda989")+
  geom_vline(xintercept=as.numeric(as.Date("2017-03-01")), col = "#AA471F", linetype = "dashed", size = 1)+
   labs(title = "Lasso with new variables",
        y = "Mean Price") +
   theme_classic()
```
We can see from this graph, that compare with the regular lasso that we did before, this graph did a much better prediction with our dataset. It not only showed a season based change for avocado price but also have some sort of change within one year period. 
However, Although this model have the lowest Test MSE, we can see that the prediction is not as good as some model that we have before. So it require us to work further on this time series dataset to get more accurate prediction. 



* ### Lasso with a previous day prediction as a variable ###

Because we are working with time series dataset, it is reasonable to assume that the previous day price plays a big part in predicting a next day price. 
Therefore, we have tried to create a code that would predict one day price and use it as a predictor for the next day price. 

Let's make a new dataframe with lagged 1 day price as an additional variable, and change it to zero for test data (as in real life we would not know the previous day price):
```{r}
avo_ts <- avo %>% 
          mutate(lag1_y = lag(AveragePrice))

dim(avo_ts)
avo_ts <- drop_na(avo_ts)
dim(avo_ts)

avo_train_ts <- avo_ts[1:12202, ]
avo_test_ts <- avo_ts[12203:18248, ]
dim(avo_test_ts)

avo_test_ts[2:6046,"lag1_y"] <- 0

```

Running a loop to predict each day using previous day prediction:

```{r}
f_ts <- as.formula(AveragePrice ~ lag1_y + month + type_conventional + type_organic + 
                           PLU4046 + PLU4770 + PLU4225 + LargeBags + XLargeBags + 
                           Area_NewEngland + Area_Southeast + Area_Mideast + 
                           Area_RockyMountain + Area_GreatLakes + 
                           Area_Southwest + Area_Plains)


y_hat_ts_test_f <- avo_test_ts[1,"lag1_y"]
avo_ts1 <- avo_ts
lambda_min_test <- 0.001570601 #the min lambda from lasso model above
check_ts <- vector()
for (i in 1:6045) {
  avo_train_ts <- avo_ts1[1:(12203+i), ]
  avo_test_ts <- avo_ts1[(12204+i):18248, ]
  
  x_ts_train <- model.matrix(f_ts,avo_train_ts)[,-1]
  x_ts_test <- model.matrix(f_ts, avo_test_ts)[,-1]
  
  y_ts_train <- avo_train_ts$AveragePrice
  y_ts_test <- avo_test_ts$AveragePrice
  
  lmodel_ts <- glmnet(x_ts_train, y_ts_train, alpha = 1, nlambda = 100)
  
  y_ts_train_hat <- predict(lmodel_ts, s = lambda_min_test, newx = x_ts_train)
  y_ts_test_hat <- predict(lmodel_ts, s = lambda_min_test, newx = x_ts_test)
  y_hat_ts_test_f <- c(y_hat_ts_test_f, y_ts_test_hat[1])
  (length(y_hat_ts_test_f))
  avo_ts1[(12204+i), "lag1_y"] <- y_ts_test_hat[1]
  
}
```
The loop gives back an error, which we, unfortunately, could not resolve (Also because this loop takes long time to run).

However, it somehow worked to gather prediction based on previous day prediction as variable. 
We can see it in here:
```{r}
length(y_hat_ts_test_f)
y_hat_ts_test_f <- unlist(y_hat_ts_test_f)
```

It is supposed to have 6046 numbers, but maybe due to the error it lost 2 numbers somewhere in the loop.
We suppose that 2 out of 6046 will not add a significant change, so in order to calculate MSE properly, we have decided to substitute the lost 2 numbers with an average of known 6044:

```{r}
y_hat_ts_test_f <- c(y_hat_ts_test_f, mean(y_hat_ts_test_f), mean(y_hat_ts_test_f))
length(y_hat_ts_test_f)
```

We need to also readjust test data because of the loop changes:

```{r}
avo_test_ts <- avo_ts[12203:18248,]
y_ts_test <- avo_test_ts$AveragePrice
```

We now can calculate MSE:

```{r}
mse_ts_test_f <- mean((y_ts_test - y_hat_ts_test_f)^2)
print(mse_ts_test_f)
```
The MSE shown above is the lowest MSE among all models we have run. This may implicate that for time series dataset it is better to use previous day prediction as a variable for next day prediction.
However, it is quite incovenient and time-consuming to adjust models for variables and predict each row separately. 

Let's see if plot can show the difference.

Need to select and prepare data for the plot first:
```{r}
avo %>% 
  select(Date, AveragePrice) -> plot_ts_full
plot_ts_full <- plot_ts_full[2:nrow(plot_ts_full),]

y_hat_train_ts <-  y_ts_train_hat[1:12202,]
y_hat_ts_full <- c(y_hat_train_ts, y_hat_ts_test_f )

plot_ts <- cbind(plot_ts_full, y_hat_ts_full)
```

Plot:
```{r}
plot_ts %>% 
  group_by(Date) %>% 
  summarise(mean_y = mean(AveragePrice),
            mean_yhat = mean(y_hat_ts_full)) %>% 
  ggplot() +
  geom_line(aes(x = Date, y = mean_y), col = "#356211") +
  geom_line(aes(x = Date, y = mean_yhat), col = "#cda989") -> ts_graph
 
ts_graph + geom_vline(aes(xintercept = as.numeric(Date[113])), 
                   linetype = "dashed", size = 1,
                   color = "#AA471F") + 
  labs(title = "Lasso with Loop",
       y = "Mean Price")
```
The graph does not show much of the difference, however we know that it is better because it has lower MSE.



----------

## Result ##

* ### Key to choose the best model ###

* ### The best one ###

* ### Challenges ###

----------

## Conclusions ##

* ### Influential Predictors ###

* ### Reflections ###
